Senior Software Engineer - Compute Infrastructure (orchestration & Scheduling)

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Senior Software Engineer focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) for AI and LLM workloads, including scheduling, resource management, and inference. The role involves enhancing performance, scalability, and cost-efficiency for training and inference, with a focus on heterogeneous resources (CPU, GPU) and open-sourcing key technologies.

What you'd actually do

  1. Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
  2. Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
  3. Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
  4. Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
  5. Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

  • Kubernetes
  • Serverless technologies
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems
  • cloud and ML infrastructure
  • resource management
  • allocation
  • job scheduling
  • monitoring
  • Docker
  • Python
  • Go
  • C++
  • Rust
  • Java

Nice to have

  • Ray
  • Yarn
  • Mesos
  • large scale resource efficiency management
  • job scheduling development
  • application scaling
  • workload co-location
  • isolation enhancement
  • AWS SageMaker
  • Azure ML
  • GCP Vertex AI
  • system efficiency
  • quality
  • performance
  • scalability

What the JD emphasized

  • large, reliable, and efficient compute infrastructure
  • AI and LLM workloads
  • building cutting-edge, industry-leading infrastructure that empowers AI innovation
  • high performance, scalability, and reliability to support the most demanding AI/LLM workloads
  • enhancing resource cost efficiency on a massive scale
  • optimize our infrastructure for AI & LLM models
  • better utilize computing resources (including CPU, GPU, power, etc.)
  • directly impacting the performance of all our AI services
  • growing compute infrastructure in overseas regions
  • scale and optimize our infrastructure globally
  • hyper-scale cluster management
  • exceptional performance, scalability, and resilience
  • truly unified scheduling
  • massive-scale resource pool
  • intelligent scheduling system
  • optimize workload performance and resource utilization
  • heterogeneous resources
  • Next-Gen ML Workloads
  • fast, reliable, and cost-effective ML and LLM training/inference
  • cloud and ML infrastructure
  • job scheduling and monitoring
  • large scale resource efficiency management and job scheduling development
  • application scaling, workload co-location, and isolation enhancement

Other signals

  • powering AI innovation
  • support the most demanding AI/LLM workloads
  • optimize our infrastructure for AI & LLM models
  • better utilize computing resources (including CPU, GPU, power, etc.)
  • performance of all our AI services
  • growing compute infrastructure in overseas regions
  • scale and optimize our infrastructure globally
  • AI/ML workloads
  • AI/ML, CPU/GPU workloads
  • AI models to optimize workload performance and resource utilization
  • heterogeneous resources—including CPU, GPU, memory, network, and power
  • Infrastructure for Next-Gen ML Workloads
  • ML and LLM training/inference
  • cloud and ML infrastructure
  • job scheduling and monitoring
  • ML services