Senior AI Infrastructure Engineer - Training Platform

Scale AI Scale AI · Data AI · New York, NY +1 · Research

This role focuses on building and scaling the infrastructure for large-scale AI model training, specifically the 'Operating System' for GPU clusters. It involves architecting a high-performance training platform, managing multi-tenant orchestration, optimizing job scheduling, and ensuring deep observability and reliability for massive workloads. The goal is to maximize the efficiency and velocity of AI researchers training advanced models.

What you'd actually do

  1. Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
  2. Design and implement scheduling primitives to optimize the lifecycle of training jobs.
  3. Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  4. Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
  5. Work closely with Finance and Procurement teams to drive our capacity planning process.

Skills

Required

  • backend or infrastructure engineering
  • orchestrating ML workloads at scale
  • Python
  • Go
  • Rust
  • C++
  • complex compute management systems
  • queueing
  • quotas
  • preemption
  • gang scheduling
  • distributed training infrastructure
  • EFA
  • Infiniband
  • topology-aware scheduling
  • distributed storage systems
  • Lustre
  • S3
  • Kubernetes internals
  • Custom Resources
  • Operators
  • Admission Controllers
  • device plugins
  • specialized hardware
  • cloud infrastructure
  • AWS
  • GCP
  • infrastructure as code
  • Terraform
  • solve complex problems
  • work independently

Nice to have

  • distributed training techniques
  • DeepSpeed
  • FSDP
  • NVIDIA software and hardware stack
  • CUDA
  • NCCL
  • PyTorch
  • post-training algorithms
  • GRPO
  • Reinforcement Learning

What the JD emphasized

  • orchestrating ML workloads at scale (100+ GPU nodes)
  • Kubernetes internals
  • distributed training infrastructure

Other signals

  • large-scale GPU clusters
  • multi-thousand GPU workloads
  • training platform
  • ML infrastructure