Software Engineer, Infrastructure - Analytics Platform

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer, Infrastructure - Analytics Platform at OpenAI. Focuses on designing, building, and operating critical backend/systems infrastructure in Rust or C++ to support research workflows. Emphasizes low-level performance, distributed systems, and hands-on operation of services at scale, including debugging production bottlenecks and improving reliability on Kubernetes.

What you'd actually do

  1. Own critical infrastructure across design, implementation, rollout, operation, and iteration.
  2. Build and operate performant backend systems in Rust or C++ that support core research workflows.
  3. Design and improve distributed data and serving systems, including tradeoffs around partitioning, replication, consistency, retries, backpressure, and failure isolation.
  4. Debug real production bottlenecks across latency, throughput, contention, hot spots, and overload behavior.
  5. Operate business-critical services through on-call, incidents, postmortems, observability, rollout safety, and zero-downtime migrations.

Skills

Required

  • Rust or C++
  • backend/systems engineering
  • low-level performance
  • distributed systems
  • production operation
  • debugging
  • Kubernetes

Nice to have

  • ClickHouse-like systems
  • analytics infrastructure
  • telemetry infrastructure
  • logging infrastructure
  • search infrastructure
  • ingestion infrastructure
  • storage infrastructure
  • query execution infrastructure

What the JD emphasized

  • strong systems experience in Rust or C++
  • strong hands-on experience building performance-sensitive backend systems in Rust or C++
  • Comfort working below typical service abstractions, including concurrency, async execution, memory behavior, serialization, I/O, networking, profiling, and failure analysis.
  • Experience designing, building, or operating distributed systems or distributed databases at meaningful scale.
  • Hands-on experience operating production-critical systems, including incidents, observability, rollout safety, and recurrence prevention.