Software Engineer, Chatgpt Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Applied AI

Software Engineer role focused on building and operating infrastructure platforms for ChatGPT, enabling fast iteration, performance, and reliability. The role involves designing shared systems, data paths, rollout mechanisms, and reliability guardrails, with a focus on platform building rather than support. Key areas include platform foundations, scalability, reliability guardrails, developer productivity, observability, safe change management, and interface design.

What you'd actually do

  1. Build and evolve infrastructure platforms used by many engineers and services.
  2. Translate real-world constraints into clean abstractions: simple APIs, enforceable contracts, safe defaults.
  3. Drive improvements in reliability and performance through principled design, measurement, and iterative hardening.
  4. Partner with engineering and product teams to identify systemic pain points and develop reusable solutions.
  5. Own outcomes end-to-end: design → implementation → rollout → operational maturity.

Skills

Required

  • Experience building and operating large-scale distributed systems in production (high throughput, concurrency, and failure handling).
  • Strong fundamentals in systems design, including caching, consistency, queueing/backpressure, and resilient dependency management.
  • Ability to reason about performance (latency distributions, tail behavior, bottlenecks) and translate analysis into concrete engineering work.
  • Track record of building platforms or shared infrastructure that improves velocity and correctness for other teams.
  • Excellent communication and collaboration skills—aligning on interfaces, navigating tradeoffs, and driving cross-team execution.

Nice to have

  • Experience designing paved roads / golden paths (frameworks, libraries, self-serve tooling) that shape engineering behavior at scale.
  • Deep understanding of reliability techniques: graceful degradation, circuit breakers, load shedding, rate limiting, and fault isolation.
  • Experience building systems for safe iteration: progressive delivery, correctness checks, automated rollout gates, and production validation.
  • Strong instincts for API and contract design—how to create interfaces that are stable, evolvable, and hard to misuse.
  • Prior work that demonstrates “force multiplier” impact: enabling many teams via a small set of well-crafted primitives.

What the JD emphasized

  • large-scale distributed systems
  • production
  • high throughput
  • concurrency
  • failure handling
  • systems design
  • performance
  • latency
  • platforms
  • shared infrastructure
  • reliability
  • observability
  • safe change management
  • API design
  • contract design