Tech Lead, Aml Orchestration

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Tech Lead for an Applied Machine Learning (AML) team focused on building and advancing distributed orchestration platforms for recommendation systems, ads ranking, and search ranking. The role involves leading a team of ML Engineers, setting technical strategy for resource efficiency, distributed training, and online inference systems, and optimizing large-scale distributed orchestration and scheduling strategies.

What you'd actually do

  1. Lead, mentor, and grow a team of orchestration-focused ML engineers; set technical vision and ensure engineering excellence.
  2. Design and optimize distributed orchestration and scheduling strategies across large-scale Kubernetes/Godel environments, ensuring efficiency, reliability, and scalability.
  3. Drive initiatives for autoscaling, resource multiplexing, and preemption across heterogeneous workloads and clusters, including multi-datacenter and multi-cloud setups.
  4. Partner with framework, platform and research teams to build next-generation distributed training and serving systems for ultra-large, high-dimensional recommendation models.
  5. Architect robust and elastic online orchestration frameworks for large-scale inference, supporting evolving recommendation and ads models.

Skills

Required

  • large-scale distributed systems
  • technical leadership
  • orchestration frameworks
  • Kubernetes
  • system performance optimization
  • resource utilization optimization
  • scheduling strategies
  • Golang
  • Python
  • C++

Nice to have

  • Ray
  • TFX
  • VeRL
  • vLLM
  • Spark
  • Flink
  • ML pipelines
  • open-source contributions
  • multi-tenant environments
  • cloud-native architectures
  • global, cross-functional team collaboration

What the JD emphasized

  • large-scale distributed systems
  • orchestration frameworks
  • online inference systems
  • distributed training
  • serving systems

Other signals

  • large-scale distributed systems
  • orchestration platforms
  • recommendation systems
  • online inference