ML Infra Engineer (supercomputing)

Physical Intelligence Physical Intelligence · AI Frontier · San Francisco, CA · Machine Learning

ML Infra Engineer responsible for designing and building a scheduling and compute layer for large-scale AI model training across heterogeneous GPU/TPU clusters. This role focuses on intelligent resource allocation, utilization, fault tolerance, and making distributed training seamless, extending to inference and robot deployment.

What you'd actually do

  1. Own Intelligent Job Scheduling and Placement
  2. Scale Multi-cluster Orchestration
  3. Optimize Accelerator Utilization and Efficiency
  4. Ensure Scaling and Stability
  5. Support Inference and Robot Deployment

Skills

Required

  • Strong software engineering fundamentals
  • Experience building or operating job scheduling / resource management systems at scale
  • Experience with large-scale compute clusters (GPU and/or TPU)
  • Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy
  • Understanding of how ML training workloads behave
  • A bias toward owning systems end-to-end, from design to operation

Nice to have

  • Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents)
  • Enjoy working closely with researchers and unblocking fast-moving projects
  • Experience building multi-cluster or federated scheduling systems
  • Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE)
  • Background in cluster resource managers (Borg, YARN, Mesos, or custom schedulers)
  • Linux systems engineering, networking, and infrastructure-as-code
  • NCCL/collective communication and topology-aware placement
  • Experience with capacity planning and cloud cost optimization at scale
  • Familiarity with JAX, PyTorch, or similar ML frameworks at the runtime/systems level

What the JD emphasized

  • systems role
  • Experience building or operating job scheduling / resource management systems at scale
  • Experience with large-scale compute clusters (GPU and/or TPU)

Other signals

  • large-scale distributed training
  • GPU/TPU clusters
  • scheduling and compute layer
  • intelligent resource allocation