Software Engineer, Infrastructure Reliability

OpenAI OpenAI · AI Frontier · San Francisco, CA · Applied AI

Software Engineer focused on reliability and scaling of infrastructure that supports AI systems, including research and products like ChatGPT and the OpenAI API. Responsibilities include designing, building, and operating reliable systems, identifying performance bottlenecks, contributing to incident response, and improving automation. Requires experience with distributed systems, Kubernetes, cloud platforms, and observability tools.

What you'd actually do

  1. Design, build, and operate reliable and performant systems used across engineering.
  2. Identify and fix performance bottlenecks and inefficiencies, ensuring our infrastructure can scale to the next order of magnitude.
  3. Dig deep to resolve complex issues.
  4. Continuously improve automation to reduce manual work. Improve internal tooling and our developer experience.
  5. Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability.

Skills

Required

  • distributed systems principles
  • building and operating scalable and reliable systems
  • performance and optimization
  • operating orchestration systems such as Kubernetes at scale
  • building abstractions over cloud platforms
  • Linux environments
  • Kubernetes
  • Terraform
  • CI/CD pipelines
  • modern observability stacks
  • collaborating with cross-functional teams
  • reliability and scalability
  • humble attitude
  • eagerness to help colleagues
  • desire to do whatever it takes to make the team succeed
  • Own problems end-to-end
  • willing to pick up whatever knowledge you're missing to get the job done
  • comfortable with ambiguity and rapid change
  • 4+ years of relevant industry experience
  • 2+ years leading large scale, complex projects or teams as an engineer or tech lead
  • passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
  • experience as an reliability engineer, production engineer, or a similar role
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure)
  • IaC tools such as Terraform
  • Proficiency in programming / scripting languages
  • Experience with containerization technologies
  • container orchestration platforms like Kubernetes
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture
  • service mesh technologies
  • Knowledge of security best practices in cloud environments
  • Strong understanding of distributed systems, networking, and database technologies
  • Excellent problem-solving skills

Nice to have

  • AI systems
  • research iteration
  • ChatGPT
  • OpenAI API

What the JD emphasized

  • scale to the next order of magnitude
  • system reliability and scalability
  • deep understanding of distributed systems principles
  • proven track record in building and operating scalable and reliable systems
  • keen eye for performance and optimization
  • experience operating orchestration systems such as Kubernetes at scale
  • building abstractions over cloud platforms
  • collaborating with cross-functional teams to ensure that reliability and scalability are considered in the design and development of new features and services
  • passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
  • Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
  • Strong understanding of distributed systems, networking, and database technologies