Software Engineer, Workload Enablement

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer to enable production workloads and end-to-end testing on new platforms for AI infrastructure. This role involves creating test harnesses, porting inference and training workloads, analyzing performance, and characterizing system behavior.

What you'd actually do

  1. Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
  2. Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts.
  3. Deep-dive performance on distributed training/inference:
  4. Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
  5. Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops).

Skills

Required

  • PyTorch
  • LLM training/inference stacks
  • Large-scale distributed training concepts
  • RDMA
  • debugging/optimizing comms libraries
  • Python
  • profiling/debugging skills

Nice to have

  • workload-shaped benchmarks
  • stress/fault tests
  • RDMA networking and transport tuning
  • Kubernetes
  • early hardware

What the JD emphasized

  • porting existing inference and training workloads
  • performance
  • distributed training/inference
  • performance-critical code
  • ML systems
  • distributed systems
  • HPC

Other signals

  • porting inference and training workloads
  • performance optimization
  • distributed systems
  • ML systems