ML Infra Engineer (tpu/jax/optimization)

Physical Intelligence Physical Intelligence · AI Frontier · San Francisco, CA · Machine Learning

ML Infra Engineer focused on scaling and optimizing training systems and core model code, managing GPU/TPU compute, job orchestration, and building efficient JAX training pipelines. Collaborates with researchers to translate ideas into production training runs.

What you'd actually do

  1. Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging.
  2. Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction.
  3. Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization.
  4. Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments.
  5. Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost.

Skills

Required

  • Software engineering fundamentals
  • ML training infrastructure or internal platforms
  • Large-scale training experience
  • Distributed training
  • Multi-host setups
  • Data loaders
  • Evaluation pipelines
  • Cloud platforms (SLURM, Kubernetes, GCP TPU/GKE, AWS)
  • Debugging and performance optimization
  • Cross-functional communication
  • Ownership mindset

Nice to have

  • Deep ML systems background
  • Training compilers
  • Runtime optimization
  • Custom kernels
  • Operating close to hardware (GPU/TPU performance tuning)
  • Robotics
  • Multimodal models
  • Large-scale foundation models
  • Designing abstractions for researcher flexibility and system reliability

What the JD emphasized

  • large-scale training
  • JAX
  • TPU
  • GPU
  • training infrastructure
  • distributed training
  • performance optimization

Other signals

  • ML Infra
  • Large-scale training
  • JAX
  • TPU/GPU optimization