Machine Learning Engineer - Orchestration

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Machine Learning Engineer focused on optimizing resource efficiency in distributed orchestration and scheduling for training and inference systems, particularly for large-scale recommendation models. The role involves building and optimizing training system architectures and online inference architectures, integrating with MLops processes, and working within Kubernetes/Godel ecosystems.

What you'd actually do

  1. Optimizing resource efficiency in distributed orchestration and scheduling, through engineering means, enhances the scale of business/models supported per unit of computing power
  2. Build a training system architecture for next-generation ultra-large and ultra-deep recommendation models
  3. Construct an online orchestration architecture for the next-generation Recommender system

Skills

Required

  • Go
  • Python
  • Linux
  • Kubernetes
  • distributed scheduling frameworks
  • distributed systems principles
  • Machine Learning systems development
  • logical analysis
  • abstraction
  • business logic splitting

Nice to have

  • PyTorch
  • TensorFlow
  • AI Infrastructure
  • High Performance Computing
  • ML Hardware Architecture
  • veRL
  • VLLM
  • Ray
  • TFX

What the JD emphasized

  • at least 5 years of experience
  • large-scale distributed systems
  • Machine Learning systems

Other signals

  • distributed orchestration
  • training systems
  • inference architecture
  • large-scale models
  • resource efficiency