Software Engineer - Compute Infrastructure (orchestration & Scheduling)

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Software Engineer role focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) for AI and LLM workloads, emphasizing resource efficiency, scheduling, and reliability. The role involves developing intelligent scheduling systems leveraging AI models and leading infrastructure for ML training/inference.

What you'd actually do

  1. Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
  2. Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
  3. Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
  4. Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
  5. Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

  • Kubernetes
  • Serverless technologies
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems
  • cloud and ML infrastructure
  • resource management
  • allocation
  • job scheduling
  • monitoring
  • Docker
  • Python
  • Go
  • C++
  • Rust
  • Java

Nice to have

  • Ray
  • Yarn
  • Mesos
  • large scale resource efficiency management
  • job scheduling development
  • application scaling
  • workload co-location
  • isolation enhancement
  • AWS
  • Azure
  • GCP
  • AWS SageMaker
  • Azure ML
  • GCP Vertex AI

What the JD emphasized

  • large-scale
  • AI and LLM workloads
  • resource cost efficiency
  • optimize our infrastructure for AI & LLM models
  • better utilize computing resources
  • ML and LLM training/inference

Other signals

  • powers hundreds of large-scale clusters globally
  • millions of online containers and offline jobs daily
  • AI and LLM workloads
  • enhancing resource cost efficiency on a massive scale
  • optimize our infrastructure for AI & LLM models
  • better utilize computing resources (including CPU, GPU, power, etc.)
  • directly impacting the performance of all our AI services
  • growing compute infrastructure in overseas regions
  • scale and optimize our infrastructure globally