Software Engineer, Agent Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer role focused on building and scaling the infrastructure for training and deploying AI agents, including a novel container orchestration platform and production deployment systems. Collaborates with researchers and product teams to enable complex agentic models and launch agentic products.

What you'd actually do

  1. Push massive compute clusters to their limits. You will be a core contributor to a novel container orchestration platform built in-house by our team to scale far beyond what’s possible with systems like Kubernetes.
  2. Develop and maintain FastAPI and gRPC APIs that serve as the interface for our agentic infrastructure used both in training and production.
  3. Use Terraform to stand up and evolve complex infrastructure for both research and production.
  4. Collaborate with research teams to stand up and optimize systems for novel AI training runs and experimental applications.

Skills

Required

  • deep experience building AI infrastructure
  • working closely with researchers to build high-performance systems at massive scale for novel use cases
  • large-scale machine learning infrastructure
  • reason about training at scale
  • identifying bottlenecks and engineering solutions to optimize system performance
  • build new things from 0-1 quickly, and then scale them 1,000,000x
  • keen eye for performance and optimization
  • squeeze the most performance out of complex, globally-distributed systems
  • cloud platforms
  • infrastructure-as-code tech like Terraform
  • solving complex, ambiguous problems at the intersection of infrastructure scalability, virtualization efficiency, and agentic capabilities
  • deep technical expertise in virtualization and containerization technologies (e.g. Kata, Firecracker, gVisor, Sysbox)
  • optimizing runtime performance
  • FastAPI
  • gRPC APIs
  • Terraform

Nice to have

  • Kubernetes

What the JD emphasized

  • novel container orchestration platform
  • scale far beyond what’s possible with systems like Kubernetes
  • agentic infrastructure
  • training and production
  • complex infrastructure for both research and production
  • novel AI training runs
  • large-scale machine learning infrastructure
  • training at scale
  • build new things from 0-1 quickly, and then scale them 1,000,000x
  • complex, globally-distributed systems
  • cloud platforms
  • infrastructure-as-code
  • complex, ambiguous problems
  • virtualization efficiency
  • agentic capabilities
  • virtualization and containerization technologies
  • optimizing runtime performance

Other signals

  • building systems that enable training and deployment of highly useful AI agents
  • building and maintaining OpenAI’s core platform for the deployment and execution of agents in production
  • scaling these new capabilities to some of the largest compute clusters in the world
  • building the platform and integrations to launch new agents to hundreds of millions of users worldwide