Software Engineer, Platform Systems

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer on the Platform Systems team at OpenAI, responsible for designing and building distributed systems for failure detection, tracing, and observability in large-scale AI training jobs. This role focuses on operating and improving the reliability and performance of OpenAI's training infrastructure.

What you'd actually do

  1. Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
  2. Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
  3. Improve observability, reliability, and performance across OpenAI’s training platform
  4. Debug and resolve issues in complex, high-throughput distributed systems
  5. Collaborate with systems, infrastructure, and research teams to evolve platform capabilities

Skills

Required

  • distributed systems
  • failure detection
  • tracing
  • observability
  • performance analysis
  • debugging
  • low-level software
  • high-performance computing
  • systems engineering

Nice to have

  • operating workflows
  • hardware
  • operating systems
  • networking
  • concurrency

What the JD emphasized

  • core model training software
  • large-scale distributed systems
  • frontier scale training
  • failure detection
  • tracing
  • observability
  • performance bottlenecks
  • massive distributed training jobs
  • critical to operating OpenAI’s training stack
  • systems engineering
  • performance analysis
  • large-scale debugging
  • low-level software
  • high-performance computing
  • low-level systems engineering
  • critical infrastructure that powers frontier AI research

Other signals

  • core AI infrastructure
  • large-scale distributed systems
  • frontier scale training