Senior Software Engineer - Compute Infrastructure (orchestration & Scheduling)

ByteDance · Big Tech · San Jose, CA · Infrastructure

Senior Software Engineer focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) for AI and LLM workloads, including scheduling, resource management, and inference. The role involves enhancing performance, scalability, and cost-efficiency for training and inference, with a focus on heterogeneous resources (CPU, GPU) and open-sourcing key technologies.

What you'd actually do

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

Kubernetes
Serverless technologies
distributed and parallel systems
high-performance networking systems
developing large scale software systems
cloud and ML infrastructure
resource management
allocation
job scheduling
monitoring
Docker
Python
Go
C++
Rust
Java

Nice to have

Ray
Yarn
Mesos
large scale resource efficiency management
job scheduling development
application scaling
workload co-location
isolation enhancement
AWS SageMaker
Azure ML
GCP Vertex AI
system efficiency
quality
performance
scalability

What the JD emphasized

large, reliable, and efficient compute infrastructure
AI and LLM workloads
building cutting-edge, industry-leading infrastructure that empowers AI innovation
high performance, scalability, and reliability to support the most demanding AI/LLM workloads
enhancing resource cost efficiency on a massive scale
optimize our infrastructure for AI & LLM models
better utilize computing resources (including CPU, GPU, power, etc.)
directly impacting the performance of all our AI services
growing compute infrastructure in overseas regions
scale and optimize our infrastructure globally
hyper-scale cluster management
exceptional performance, scalability, and resilience
truly unified scheduling
massive-scale resource pool
intelligent scheduling system
optimize workload performance and resource utilization
heterogeneous resources
Next-Gen ML Workloads
fast, reliable, and cost-effective ML and LLM training/inference
cloud and ML infrastructure
job scheduling and monitoring
large scale resource efficiency management and job scheduling development
application scaling, workload co-location, and isolation enhancement

Other signals

powering AI innovation
support the most demanding AI/LLM workloads
optimize our infrastructure for AI & LLM models
better utilize computing resources (including CPU, GPU, power, etc.)
performance of all our AI services
growing compute infrastructure in overseas regions
scale and optimize our infrastructure globally
AI/ML workloads
AI/ML, CPU/GPU workloads
AI models to optimize workload performance and resource utilization
heterogeneous resources—including CPU, GPU, memory, network, and power
Infrastructure for Next-Gen ML Workloads
ML and LLM training/inference
cloud and ML infrastructure
job scheduling and monitoring
ML services

Read full job description

About the Team The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads.The team is also dedicated to open-sourcing key infrastructure technologies, including projects in the K8s portfolio such as kubewharf, Serverless initiatives like Ray on K8s, and LLM inference control plan project AiBrix.

At ByteDance, as we expand and innovate, powering global platforms like TikTok and various AI/ML & LLM initiatives, we face the challenge of enhancing resource cost efficiency on a massive scale within our rapidly growing compute infrastructure. We're seeking talented software engineers excited to optimize our infrastructure for AI & LLM models. Your expertise can drive solutions to better utilize computing resources (including CPU, GPU, power, etc.), directly impacting the performance of all our AI services and helping us build the future of computing infrastructure. Also, with the goal of growing compute infrastructure in overseas regions, including North America, Europe, and Asia Pacific, you will have the opportunities of working closely with leaders from ByteDance’s global business units to ensure that we continue to scale and optimize our infrastructure globally.

Responsibilities

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Requirements

Minimum Qualifications

B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
Familiarity with container and orchestration technologies such as Docker and Kubernetes.
Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred Qualifications

Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
Experience in large scale resource efficiency management and job scheduling development
Project experience in application scaling, workload co-location, and isolation enhancement
Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
Great communication skills and the ability to work well within a team and across engineering teams.
Passionate about system efficiency, quality, performance and scalability

Responsibilities

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Requirements

Minimum Qualifications

B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
Familiarity with container and orchestration technologies such as Docker and Kubernetes.
Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred Qualifications

Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
Experience in large scale resource efficiency management and job scheduling development
Project experience in application scaling, workload co-location, and isolation enhancement
Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
Great communication skills and the ability to work well within a team and across engineering teams.
Passionate about system efficiency, quality, performance and scalability