Senior ML Engineer

Apple Apple · Big Tech · Seattle, WA +2 · Software and Services

Senior ML Engineer to design, develop, and deploy high-performance systems for batch processing and ML workloads across Apple's global infrastructure. Focus on Kubernetes-native scheduling, resource management, and multi-tenant environments for ML training and inference.

What you'd actually do

  1. Design, build, and deploy highly reliable, large-scale distributed systems for batch processing and ML infrastructure across public clouds and Apple data centers using Go, Java, or Python
  2. Architect and operate Kubernetes-native scheduling systems such as Kueue and YuniKorn, building custom operators and CRDs to manage complex ML and data workloads
  3. Implement advanced scheduling strategies including gang scheduling, topology-aware routing, bin-packing, and fair-share queuing to maximize GPU efficiency and hardware utilization
  4. Build and manage secure, multi-tenant Kubernetes environments with strict resource isolation, quota governance, and priority-based preemption
  5. Drive end-to-end observability, monitoring, and incident response practices to ensure high availability and fault tolerance of production systems

Skills

Required

  • 5+ years of experience designing, developing, and operating highly available, large-scale distributed systems and data or ML infrastructure
  • Strong software engineering skills with deep programming expertise in Go, Java, or Python
  • Advanced knowledge of Kubernetes internals including custom controllers, scheduler architecture, resource quotas, and workload lifecycle management
  • Hands-on experience with Kubernetes-native batch scheduling frameworks such as Kueue or YuniKorn and advanced scheduling concepts like gang scheduling, bin-packing, and priority preemption
  • Experience with cloud-native infrastructure across multi-cloud environments including AWS, GCP, and on-premises systems
  • Strong commitment to operational excellence, system observability, and continuous improvement for mission-critical services

Nice to have

  • GPU scheduling, accelerator-aware placement, and optimization for large-scale AI/ML workloads
  • Experience with distributed data and ML frameworks such as Apache Spark, Ray, PyTorch, JAX, or Flink at scale
  • Experience contributing to open-source projects in Kubernetes scheduling, container technologies, or ML infrastructure ecosystems such as Apache YuniKorn, Kueue, or similar systems
  • Experience using GenAI technologies to improve developer productivity, streamline engineering processes, and accelerate team execution

What the JD emphasized

  • highly reliable, large-scale distributed systems
  • Kubernetes scheduling
  • ML infrastructure
  • Kubernetes-native batch scheduling frameworks
  • advanced scheduling concepts

Other signals

  • ML infrastructure platform
  • Kubernetes scheduling
  • distributed systems
  • large-scale batch and ML workloads