What you'd actually do

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.

Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.

Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.

Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.

Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

Kubernetes
Serverless technologies
distributed and parallel systems
high-performance networking systems
developing large scale software systems
cloud and ML infrastructure
resource management
allocation
job scheduling
monitoring
Docker
Python
Go
C++
Rust
Java

Nice to have

Kubernetes
Ray
Yarn
Mesos
large scale resource efficiency management
job scheduling development
application scaling
workload co-location
isolation enhancement
AWS SageMaker
Azure ML
GCP Vertex AI

What the JD emphasized

large-scale compute infrastructure

AI and LLM workloads

optimize our infrastructure for AI & LLM models

resource cost efficiency on a massive scale

AI services

ML and LLM training/inference

large scale software systems

cloud and ML infrastructure

large scale cluster management systems

large scale resource efficiency management

About the Team The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads.The team is also dedicated to open-sourcing key infrastructure technologies, including projects in the K8s portfolio such as kubewharf, Serverless initiatives like Ray on K8s, and LLM inference control plan project AiBrix.

At ByteDance, as we expand and innovate, powering global platforms like TikTok and various AI/ML & LLM initiatives, we face the challenge of enhancing resource cost efficiency on a massive scale within our rapidly growing compute infrastructure. We're seeking talented software engineers excited to optimize our infrastructure for AI & LLM models. Your expertise can drive solutions to better utilize computing resources (including CPU, GPU, power, etc.), directly impacting the performance of all our AI services and helping us build the future of computing infrastructure. Also, with the goal of growing compute infrastructure in overseas regions, including North America, Europe, and Asia Pacific, you will have the opportunities of working closely with leaders from ByteDance’s global business units to ensure that we continue to scale and optimize our infrastructure globally.

Responsibilities

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Requirements

Minimum Qualifications

B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 2+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
Familiarity with container and orchestration technologies such as Docker and Kubernetes.
Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred Qualifications

Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
Experience in large scale resource efficiency management and job scheduling development
Project experience in application scaling, workload co-location, and isolation enhancement
Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
Great communication skills and the ability to work well within a team and across engineering teams.
Passionate about system efficiency, quality, performance and scalability

Responsibilities

Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Requirements

Minimum Qualifications

B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 2+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
Familiarity with container and orchestration technologies such as Docker and Kubernetes.
Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred Qualifications

Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
Experience in large scale resource efficiency management and job scheduling development
Project experience in application scaling, workload co-location, and isolation enhancement
Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
Great communication skills and the ability to work well within a team and across engineering teams.
Passionate about system efficiency, quality, performance and scalability

Software Engineer - Compute Infrastructure (orchestration & Scheduling)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Requirements

Requirements