Software Engineer, Platform Systems

OpenAI OpenAI · AI Frontier · London, United Kingdom · Scaling

Software Engineer on the Platform Systems team at OpenAI, responsible for designing and building distributed systems for failure detection, tracing, and observability to support large-scale AI model training. The role focuses on improving the reliability, performance, and visibility of OpenAI's training infrastructure.

What you'd actually do

  1. Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
  2. Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
  3. Improve observability, reliability, and performance across OpenAI’s training platform
  4. Debug and resolve issues in complex, high-throughput distributed systems
  5. Collaborate with systems, infrastructure, and research teams to evolve platform capabilities

Skills

Required

  • distributed systems
  • failure detection
  • tracing
  • observability
  • performance analysis
  • large-scale debugging
  • low-level software
  • high-performance computing
  • systems engineering

Nice to have

  • hardware
  • operating systems
  • networking
  • concurrency

What the JD emphasized

  • critical to operating OpenAI’s training stack
  • actively evolving to support new use cases and increasingly complex workloads
  • critical infrastructure that powers frontier AI research

Other signals

  • train OpenAI’s flagship models
  • large-scale distributed systems
  • frontier scale
  • failure detection, tracing, and observability systems