ML Research Engineer - Hardware Codesign

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Research Engineer focused on hardware-silicon co-design for AI workloads, optimizing numerics, architecture, and technology bets. Involves debugging performance gaps, writing quantization kernels, evaluating numerics via model evals, and prototyping RTL for novel numeric modules. Aims to bridge ML research and hardware development for OpenAI's supercomputing infrastructure.

What you'd actually do

  1. Build on our roofline simulator to track evolving workloads, and deliver analyses that quantify the impact of system architecture decisions and support technology pathfinding.
  2. Debug gaps between performance simulation and real measurements; clearly communicate root cause, bottlenecks, and invalid assumptions.
  3. Write emulation kernels for low-precision numerics and lossy compression schemes, and get Research the information they need to trade efficiency with model quality.
  4. Prototype numerics modules by pushing RTL through synthesis; hand off novel numerics cleanly, or occasionally own an RTL module end-to-end.
  5. Proactively pull in new ML workloads, prototype them with rooflines and/or functional simulation, and drive initial evaluation of new opportunities or risks.

Skills

Required

  • Python
  • C++ or Rust
  • Triton, CUDA, or similar
  • PyTorch or JAX
  • Floating point numerics
  • Transformer models
  • ML workloads
  • Quantization
  • Performance simulation
  • RTL development

Nice to have

  • Experience in large ML codebases
  • Experience writing RTL (especially for floating point logic)
  • Understanding of PPA tradeoffs

What the JD emphasized

  • exceptional track record of high-quality technical output
  • bias for shipping a prototype now and iterating later in the absence of clear requirements
  • Strong Python, and C++ or Rust, with a cautious attitude toward correctness and an intuition for clean extensibility
  • Experience writing Triton, CUDA, or similar, and an understanding of the resulting mapping of tensor ops to functional units
  • Practical understanding of floating point numerics, the ML tradeoffs of reduced precision, and the current state of the art in model quantization
  • Deep understanding of transformer models, and strong intuition for transformer rooflines and the tradeoffs of sharded training and inference in large-scale ML systems
  • Experience writing RTL (especially for floating point logic) and understanding of PPA tradeoffs is a plus
  • Strong cross-functional communication (e.g. across ML researchers and hardware engineers); ability to slice ambiguous early-incubation ideas into concrete arenas in which progress can be made

Other signals

  • hardware codesign
  • ML workloads
  • quantization kernels
  • model evals
  • RTL