Inference Technical Lead, On-device Transformers

OpenAI OpenAI · AI Frontier · San Francisco, CA · Consumer Products

Lead the implementation of the low-level inference stack for on-device transformer models, including kernel development and runtime systems, while collaborating with researchers and hardware vendors to optimize model architectures for deployment constraints.

What you'd actually do

  1. Evaluate and select silicon platforms (GPUs, NPUs, and specialized accelerators) for on-device and edge deployment of OpenAI models.
  2. Work closely with research teams to co-design model architectures that meet real-world deployment constraints such as latency, memory, power, and bandwidth.
  3. Analyze and model system performance, identifying tradeoffs between model design, memory hierarchy, compute throughput, and hardware capabilities.
  4. Partner with hardware vendors and internal infrastructure teams to bring up new accelerators and ensure efficient execution of transformer workloads.
  5. Build and lead a team of engineers responsible for implementing the low-level inference stack, including kernel development and runtime systems.
  6. Run through the necessary walls to take nascent research capabilities and turn them into capabilities we can build on top of.

Skills

Required

  • Experience evaluating or deploying workloads on GPUs, NPUs, or other specialized accelerators.
  • Understanding of the performance characteristics of transformer models, including attention, KV-cache behavior, and memory bandwidth requirements.
  • Experience designing or optimizing high-performance compute systems, such as inference engines, distributed runtimes, or hardware-aware ML pipelines.
  • Experience building or leading teams working on low-level performance-critical software such as CUDA kernels, compilers, or ML runtimes.
  • Experience teaching models to speak and perceive.

Nice to have

  • Co-design model architectures with research teams.
  • Analyze and model system performance.
  • Partner with hardware vendors and internal infrastructure teams.
  • Lead a team of engineers.

What the JD emphasized

  • low-level inference stack
  • kernel development
  • runtime systems
  • silicon platforms
  • transformer models
  • inference engines
  • distributed runtimes
  • hardware-aware ML pipelines
  • CUDA kernels
  • compilers
  • ML runtimes
  • teaching models to speak and perceive

Other signals

  • on-device transformers
  • inference stack
  • kernel development
  • runtime systems
  • silicon platforms