Software Engineer - Compute Infrastructure (orchestration & Scheduling)

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Software Engineer role focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) to support AI and LLM workloads, including training and inference. The role involves enhancing cluster management, developing intelligent scheduling systems leveraging AI models for resource optimization, and leading infrastructure for next-gen ML workloads.

What you'd actually do

  1. Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
  2. Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
  3. Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
  4. Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
  5. Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

  • Kubernetes
  • Serverless technologies
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems
  • cloud and ML infrastructure
  • resource management
  • allocation
  • job scheduling
  • monitoring
  • Docker
  • Python
  • Go
  • C++
  • Rust
  • Java

Nice to have

  • Kubernetes
  • Ray
  • Yarn
  • Mesos
  • large scale resource efficiency management
  • job scheduling development
  • application scaling
  • workload co-location
  • isolation enhancement
  • AWS SageMaker
  • Azure ML
  • GCP Vertex AI

What the JD emphasized

  • large-scale compute infrastructure
  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • resource cost efficiency on a massive scale
  • AI services
  • ML and LLM training/inference
  • large scale software systems
  • cloud and ML infrastructure
  • large scale cluster management systems
  • large scale resource efficiency management

Other signals

  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • support the most demanding AI/LLM workloads
  • ML and LLM training/inference