Chatgpt Performance Engineer

OpenAI OpenAI · AI Frontier · San Francisco, CA · Applied AI

OpenAI is seeking an experienced Performance Engineer to optimize the performance, reliability, and efficiency of their AI systems, including ChatGPT and the OpenAI API. The role involves deep technical expertise in analyzing and optimizing infrastructure and application layers, developing observability tooling, influencing architecture, and leading investigations into performance issues. This is an individual contributor role focused on high-scale distributed systems.

What you'd actually do

  1. Analyze and optimize performance across application, middleware, runtime, and infrastructure layers—networking, storage, Python runtime, GPU utilization, and beyond.
  2. Develop tooling and metrics that provide deep observability into system performance.
  3. Collaborate closely with infra, platform, training, and product teams to identify key performance goals and drive systemic improvements.
  4. Influence architecture and design decisions to prioritize latency, throughput, and efficiency at scale.
  5. Lead investigations into high-impact performance regressions or scalability issues in production.

Skills

Required

  • software engineering
  • performance engineering
  • reliability engineering
  • distributed systems
  • performance profiling
  • tracing systems
  • observability
  • benchmarking
  • OS internals
  • scheduling
  • memory management
  • IO patterns
  • Python
  • Golang
  • GPU utilization

Nice to have

  • database optimization
  • networking optimization
  • storage optimization
  • application runtime optimization
  • GC tuning
  • Python internals
  • Golang internals

What the JD emphasized

  • 7+ years of experience in software engineering with a strong track record in performance or reliability of high-scale distributed systems
  • deep technical expertise
  • performance profiling tools and tracing systems
  • high-scale distributed systems
  • performance testing strategies
  • SLAs/SLOs

Other signals

  • optimize infrastructure and application-level performance
  • push our latency, throughput, and cost-efficiency to the next level
  • deep systems understanding
  • measurable impact
  • root-cause analysis, profiling, instrumentation, and architecture-level performance improvements