Training, Process Management Engineer

OpenAI OpenAI · AI Frontier · London, United Kingdom · Scaling

OpenAI's Training Runtime team is seeking a Process Management Engineer to build and maintain the distributed OS responsible for launching, coordinating, and supervising large-scale machine learning training workloads. This role involves working with Rust and Python to create high-performance, reliable, and observable systems that enable researchers to scale ideas from experiments to production training runs on massive clusters.

What you'd actually do

  1. Design, build, and maintain software to orchestrate and monitor machine learning workloads on our largest supercomputers
  2. Profile and optimize our software stack to support computation orchestration at frontier scale
  3. Improve reliability, observability, and fault tolerance for long-running jobs
  4. Debug complex distributed systems issues across large clusters
  5. Respond to the changing shapes and needs of the ML systems to enable our researchers

Skills

Required

  • Rust
  • Python
  • distributed systems
  • high-performance computing
  • systems-level debugging
  • performance analysis
  • memory profiling
  • asynchronous and concurrent systems
  • Linux

Nice to have

  • C++

What the JD emphasized

  • frontier-scale
  • novel challenges
  • highly ambiguous
  • strong design judgment
  • proficient execution
  • end-to-end platform
  • high-performance architectures
  • rapid pace
  • dynamic and evolving needs

Other signals

  • distributed systems
  • high-performance
  • scalability
  • reliability
  • observability
  • Rust
  • Python