Senior Software Engineer - Compute Infrastructure (orchestration & Scheduling)

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Senior Software Engineer focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) for AI and LLM workloads, including scheduling, resource management, and inference. The role involves developing intelligent scheduling systems using AI models and contributing to open-source projects.

What you'd actually do

  1. Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
  2. Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
  3. Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
  4. Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
  5. Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Skills

Required

  • B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience
  • Unix/Linux environments
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems
  • cloud and ML infrastructure
  • resource management
  • allocation
  • job scheduling
  • monitoring
  • Docker
  • Kubernetes
  • Python
  • Go
  • C++
  • Rust
  • Java

Nice to have

  • Ph.D. degree and strong publication records
  • Kubernetes
  • Ray
  • Yarn
  • Mesos
  • large scale resource efficiency management
  • job scheduling development
  • application scaling
  • workload co-location
  • isolation enhancement
  • AWS
  • Azure
  • GCP
  • AWS SageMaker
  • Azure ML
  • GCP Vertex AI
  • Great communication skills
  • work well within a team and across engineering teams
  • Passion for system efficiency, quality, performance and scalability

What the JD emphasized

  • AI and LLM workloads
  • AI innovation
  • optimize our infrastructure for AI & LLM models
  • ML and LLM training/inference

Other signals

  • AI/LLM workloads
  • AI innovation
  • optimize infrastructure for AI & LLM models
  • ML and LLM training/inference