Backend Software Engineer (evals)

OpenAI OpenAI · AI Frontier · San Francisco, CA · User Operations

Backend Software Engineer to design and build an evals infrastructure for measuring the quality of OpenAI's support automation. The role involves building robust systems and backend services, integrating data, and collaborating with data science and research partners. Experience with AI agents, LLM evaluation methods, and distributed systems is required.

What you'd actually do

  1. Design eval pipelines that are reliable, reproducible, and extendable
  2. Build the infrastructure for continuous eval monitoring frameworks (regression/drift monitoring, building robust golden datasets) along with feedback loops that ultimately strengthen support automation
  3. Design, build, and maintain backend services and APIs to support intelligent automation and knowledge systems
  4. Integrate and structure data across internal platforms, transforming it into formats optimized for use by downstream systems and AI workflows.
  5. Collaborate closely with data, research, and engineering teams to integrate OpenAI models into high-leverage workflows

Skills

Required

  • 4+ years of backend engineering experience
  • Proficiency in backend technologies (Python, FastAPI, Postgres)
  • Experience designing and scaling distributed systems, APIs, or data processing pipelines
  • Experience building AI agents or applications, including designing evals and improving performance through prompting or scaffolding
  • Experience creating production evals and/or measuring performance of ML/LLM models at scale

Nice to have

  • Familiarity with evaluation methods for LLMs and have worked with patterns like multi-agent workflows, tool use, or long context.

What the JD emphasized

  • measures the quality of OpenAI’s support automation
  • design and build evals at scale
  • building robust systems and backend services
  • designing evals and improving performance
  • production evals and/or measuring performance of ML/LLM models at scale

Other signals

  • eval infrastructure
  • ML/LLM-heavy domains
  • backend services and APIs
  • AI agents or applications
  • evaluation methods for LLMs