ML Infra Engineer

Physical Intelligence Physical Intelligence · AI Frontier · San Francisco, CA · Machine Learning

ML Infra Engineer to scale and optimize training systems and core model code, managing GPU/TPU compute, job orchestration, and JAX training pipelines. Collaborates with researchers to translate ideas into production training runs.

What you'd actually do

  1. Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging.
  2. Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction.
  3. Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization.
  4. Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments.
  5. Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost.

Skills

Required

  • Software engineering fundamentals
  • ML training infrastructure
  • internal platforms
  • large-scale training
  • JAX
  • PyTorch
  • distributed training
  • multi-host setups
  • data loaders
  • evaluation pipelines
  • cloud platforms
  • SLURM
  • Kubernetes
  • GCP TPU/GKE
  • AWS
  • debug and optimize performance bottlenecks
  • cross-functional communication
  • ownership mindset

Nice to have

  • Deep ML systems background
  • training compilers
  • runtime optimization
  • custom kernels
  • operating close to hardware
  • GPU/TPU performance tuning
  • robotics
  • multimodal models
  • large-scale foundation models
  • designing abstractions

What the JD emphasized

  • large-scale training
  • JAX
  • training infrastructure

Other signals

  • ML training infrastructure
  • large-scale training
  • JAX
  • distributed training
  • GPU/TPU compute