Staff Site Reliability Engineer – Automation and Platform

Cerebras Cerebras · Semiconductors · Headquarters +2 · AI Cloud

Staff Site Reliability Engineer focused on building and scaling high-performance SRE functions for Cerebras' AI inference services, powered by their Wafer-Scale Engine. The role involves leading engineering efforts to implement self-service delivery pipelines, shared observability tooling, and GitOps-driven CD for model releases and cluster management. The goal is to enable core teams, product managers, and external customers to operate in a fully self-service model with strong reliability guarantees, while also mentoring early-career SREs. The role emphasizes turning complexity into reliability at scale for frontier AI inference.

What you'd actually do

  1. Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud-based solutions.
  2. Architect self-service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs.
  3. Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi-datacenter and on-prem environments.
  4. Mentor mid-level SREs, support critical incident escalations, and use production pain points to prioritize the highest-leverage automation work.
  5. Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self-service workflows.

Skills

Required

  • SRE
  • infrastructure engineering
  • platform engineering
  • automation
  • reliability at scale
  • large scale heterogenous clusters
  • proprietary cloud control plane
  • CI/CD
  • GitOps
  • Argo CD
  • observability systems
  • Loki
  • Tempo
  • Mimir
  • Prometheus
  • lead complex projects
  • influence cross-functional stakeholders
  • communicate technical direction

Nice to have

  • Bazel
  • large-scale build systems
  • AI/ML inference systems
  • model serving runtimes
  • GPU or wafer-scale orchestration
  • latency and accuracy SLOs
  • drift monitoring
  • predictive autoscaling
  • chaos engineering
  • cost-aware capacity planning
  • compute-intensive workloads

What the JD emphasized

  • lead the engineering effort to eliminate toil at scale
  • architecting and delivering the "tomorrow" layer
  • fully self-service model with strong reliability guarantees
  • deeply understand their pain points, automate their toil
  • shift reliability from an ops-only burden to a shared engineering discipline
  • turning complexity into elegant reliability at scale
  • robust strategy for delivering and running software reliably and at scale
  • self-service platforms and internal tooling
  • safely trigger and observe critical workflows with minimal handoffs
  • reliability practices for inference workloads
  • SLOs and SLIs for latency, throughput, and accuracy stability
  • error budgets
  • blameless postmortems
  • chaos testing
  • capacity forecasting
  • multi-datacenter and on-prem environments
  • mentor mid-level SREs
  • support critical incident escalations
  • production pain points to prioritize the highest-leverage automation work
  • clear metrics
  • toil reduction
  • deployment velocity
  • SLO compliance
  • MTTR
  • adoption of self-service workflows
  • 8+ years in SRE, infrastructure engineering, or platform engineering
  • strong record of improving automation and reliability at large scale
  • Deep expertise operating large scale heterogenous clusters with a proprietary cloud control plane
  • Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in.
  • Hands-on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus
  • Ability to lead complex projects end to end, influence cross-functional stakeholders, and communicate technical direction clearly.
  • Background in AI/ML inference systems, including model serving runtimes, GPU or wafer-scale orchestration, latency and accuracy SLOs, or drift monitoring.
  • Prior work on predictive autoscaling, chaos engineering, or cost-aware capacity planning for compute-intensive workloads.

Other signals

  • AI inference services
  • Wafer-Scale Engine (WSE)
  • OpenAI partnership
  • leading model builders
  • frontier labs
  • self-service delivery pipelines
  • shared observability common tooling
  • declarative GitOps-driven CD for model releases
  • capacity provisioning and cluster upgrades
  • fully self-service model with strong reliability guarantees
  • automate their toil
  • mentor them as platform engineers
  • reliability from an ops-only burden to a shared engineering discipline
  • frontier AI inference at scale
  • turning complexity into elegant reliability at scale
  • SLOs and SLIs for latency, throughput, and accuracy stability
  • error budgets
  • blameless postmortems
  • chaos testing
  • capacity forecasting
  • multi-datacenter and on-prem environments
  • AI/ML inference systems
  • model serving runtimes
  • GPU or wafer-scale orchestration
  • latency and accuracy SLOs
  • drift monitoring
  • predictive autoscaling
  • chaos engineering
  • cost-aware capacity planning
  • compute-intensive workloads