Software Engineer, Compute Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on building and optimizing the compute infrastructure platform that supports OpenAI's AI research and products. It involves designing, provisioning, scheduling, operating, and optimizing systems that connect various hardware and software components for large-scale AI workloads. The role spans across the entire stack, from low-level systems and high-performance computing to distributed infrastructure, reliability, and developer experience, with a specific mention of agent infrastructure.

What you'd actually do

  1. Build and deeply optimize reliable system software for large-scale compute systems that run some of the world's most demanding AI workloads
  2. Design and operate infrastructure across accelerators, CPUs, NICs, switches, networking protocols, storage, data centers, cluster orchestration, scheduling, and fleet health
  3. Profile, benchmark, and optimize training workloads across compute, memory, storage, networking, NCCL and collective communication, and cluster scheduling bottlenecks
  4. Create hardware-aware automation that makes provisioning, firmware and driver upgrades, incident response, and day-to-day operations faster and less error-prone
  5. Build CaaS, agent infrastructure, profiling, observability, benchmarking, and platform tools that help researchers, product engineers, and operators launch, debug, and optimize workloads with less friction

Skills

Required

  • distributed systems
  • infrastructure platforms
  • high-performance computing
  • large-scale networking systems
  • Kubernetes
  • developer tools
  • production systems
  • reliability

Nice to have

  • software
  • hardware
  • networking
  • systems performance
  • reliability
  • user needs

What the JD emphasized

  • agent infrastructure
  • agentic workloads

Other signals

  • building the compute platform behind OpenAI's research and products
  • large-scale compute systems
  • AI workloads
  • agent infrastructure