Lead AI Infrastructure Engineer, Reinforcement Learning

AMD AMD · Semiconductors · Santa Clara, CA · Engineering

Lead AI Infrastructure Engineer focused on Reinforcement Learning, responsible for building and scaling distributed RL training infrastructure, including policy/value training, rollout generation, logging, checkpointing, and researcher-facing APIs. The role aims to improve throughput, fault tolerance, reproducibility, and observability for RL research programs.

What you'd actually do

  1. Design and implement distributed RL training stacks (data parallel, pipeline parallel, or hybrid) integrated with AMD’s schedulers and storage
  2. Build high-throughput rollout workers, trajectory stores, and reward computation pipelines with versioning and audit trails
  3. Instrument jobs for debugging (NaNs, stragglers, OOMs), implement autoscaling and preemption-safe checkpointing
  4. Collaborate with research scientists on experiment templates, hyperparameter sweeps, and safe promotion paths from research to wider team use
  5. Drive reliability: on-call rotations, runbooks, and postmortems for infra incidents affecting RL training

Skills

Required

  • PyTorch (or JAX)
  • NCCL/MPI-style distributed training
  • GPU cluster orchestration
  • C++/Python performance tuning
  • I/O optimization
  • containerized workloads

Nice to have

  • deep systems expertise
  • demonstrated technical impact
  • Prior ownership of RL training infra
  • LLM post-training pipelines
  • large-scale experiment management
  • Master’s or PhD in Computer Science

What the JD emphasized

  • reliable systems
  • researcher time as expensive as GPU time
  • SLAs
  • capacity plans
  • incident patterns
  • cost–quality tradeoffs
  • debugging (NaNs, stragglers, OOMs)
  • autoscaling
  • preemption-safe checkpointing
  • safe promotion paths
  • on-call rotations
  • runbooks
  • postmortems

Other signals

  • distributed RL training stacks
  • high-throughput rollout workers
  • researcher-facing APIs across large GPU fleets
  • improving throughput, fault tolerance, reproducibility, and observability