Site Reliability Engineer, Infrastructure - Analytics Platform

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Site Reliability Engineer focused on operating and scaling data infrastructure, including ClickHouse, Kafka, and Snowflake, to support research workloads. The role involves end-to-end lifecycle management, monitoring, incident response, and partnering with software engineers to ensure reliability and performance of data-heavy, low-latency systems.

What you'd actually do

  1. Own infrastructure lifecycle management across provisioning, upgrades, scaling, and decommissioning (IaC-first).
  2. Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, performance tuning, and maintenance.
  3. Operate Kafka as the ingestion backbone, improving throughput, lag, backpressure handling, and failure recovery.
  4. Improve end-to-end latency and reliability for data-heavy serving and query workloads.
  5. Build and maintain strong monitoring and alerting: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.

Skills

Required

  • Experience owning production infrastructure for data-heavy, low-latency systems end to end.
  • Hands-on experience operating ClickHouse, Kafka, and adjacent large-scale data systems.
  • Experience with Snowflake workflows and cross-system data architecture.
  • Ability to independently define operational standards and drive adoption.
  • Operational experience with Kubernetes, Terraform, and cloud infrastructure.
  • Excellent communication and collaboration skills.
  • High personal rigor and organization in high-pressure production environments.
  • Deeply hands-on mindset: willing to debug incidents, tune systems, and implement fixes directly.

What the JD emphasized

  • independently define and raise operational standards across teams
  • independently define operational standards (runbooks, incident process, rollout safety) and make them stick