Software Engineer, Fleet Management

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer focused on building and managing large-scale cloud and bare-metal compute fleets, including hardware, configurations, and vendor interactions. The role involves developing tools for cluster management, integrating hardware metrics with scheduling, automating infrastructure processes, and leveraging LLMs for vendor coordination and workflow optimization. This position supports OpenAI's AI research and product development by ensuring the reliability and efficiency of the underlying computing environment.

What you'd actually do

  1. Design and build systems to manage both cloud and bare-metal fleets at scale.
  2. Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms.
  3. Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows.
  4. Automate infrastructure processes, reducing repetitive toil and improving system reliability.
  5. Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack.

Skills

Required

  • Software engineering skills
  • large-scale infrastructure environments
  • cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers)
  • server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing)

Nice to have

  • optimizing the performance and reliability of large compute fleets
  • dynamic environments
  • complex infrastructure challenges
  • automation
  • efficiency
  • continuous improvement

What the JD emphasized

  • large-scale infrastructure environments
  • cluster-level systems
  • server-level systems