Research Infrastructure Engineer, Training Systems

OpenAI OpenAI · AI Frontier · San Francisco, CA · Research

This role is for a Research Infrastructure Engineer focused on ML training systems at OpenAI. The engineer will build and maintain the infrastructure that enables novel research ideas for large-scale model training, improving reliability, debuggability, and performance. The work involves debugging across various systems (Python, PyTorch, distributed systems, GPUs, networking, storage) and designing APIs for complex training workflows.

What you'd actually do

  1. Build and maintain infrastructure for large-scale model training and experimentation.
  2. Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
  3. Improve reliability, debuggability, and performance across training and data pipelines.
  4. Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
  5. Write tests, benchmarks, and diagnostics that catch meaningful regressions.

Skills

Required

  • ML training infrastructure
  • large-scale model training
  • distributed systems
  • Python
  • PyTorch
  • GPU systems
  • networking
  • storage
  • API design
  • systems debugging
  • testing
  • benchmarking

Nice to have

  • novel training approaches
  • performance optimization
  • reliability engineering
  • clean abstractions

What the JD emphasized

  • build systems that enable new model training approaches, not just optimize established ones
  • strong systems instincts and care deeply about performance, reliability, and clean abstractions
  • good taste in API and interface design, with empathy for the researchers and engineers using your tools
  • comfortable working across ML research code and production-quality infrastructure
  • debugging from evidence: profiles, traces, logs, tests, and minimal reproductions

Other signals

  • systems engineering role focused on ML training infrastructure
  • build the infrastructure needed to make new training approaches practical at scale
  • systems work is directly tied to research progress