Software Engineer, Fleet Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer role focused on building and operating infrastructure systems for a large GPU fleet, supporting AI model training and deployment. Responsibilities include scheduling, cluster management, and deployment systems, with a bonus for understanding AI/ML workloads.

What you'd actually do

  1. Design, implement and operate components of our compute fleet including job scheduling, cluster management, snapshot delivery, and CI/CD systems.
  2. Interface with researchers and product teams to understand workload requirements
  3. Collaborate with hardware, infrastructure, and business teams to provide a high utilization and high reliability service

Skills

Required

  • hyperscale compute systems
  • strong programming skills
  • public clouds (especially Azure)
  • Kubernetes
  • Execution focused mentality
  • rigorous focus on user requirements

Nice to have

  • understanding of AI/ML workloads

What the JD emphasized

  • world’s largest, most reliable, and frictionless GPU fleet
  • design, write, deploy, and operate infrastructure systems for model deployment and training
  • scale is immense, the timelines are tight, and the organization is moving fast