Site Reliability Engineer - Ops & Automation

Cerebras · Semiconductors · Headquarters +2 · AI Cloud

Cerebras is seeking a Site Reliability Engineer to support their high-performance AI inference services powered by the Wafer-Scale Engine. The role involves operational execution, developing self-service CD pipelines, building automation tools, and enhancing observability for large-scale AI infrastructure. The position requires production Kubernetes experience and proficiency in Python or Go.

What you'd actually do

  1. Remain hands-on with operational execution (releases, capacity changes, cluster upgrades) over the next year as we build robust continuous delivery pipelines and self-service capabilities
  2. Contribute to the development of self-service CD pipelines for key workflows using our stack: Kubernetes, Bazel, Prometheus/Grafana/InfluxDB, Python, and Go.
  3. Build reusable automation and internal developer tools that minimize operational toil and cross-team friction
  4. Develop and extend telemetry, observability and alerting solutions to ensure operational reliability at scale
  5. Collaborate with Cluster Ops and development teams to identify high-impact automation opportunities and iterate quickly

Skills

Required

  • SRE
  • operations
  • automation
  • Kubernetes
  • Python
  • Go
  • Prometheus
  • Grafana
  • observability

Nice to have

  • GitOps
  • Argo CD
  • Flux
  • continuous delivery pipelines
  • Bazel
  • capacity planning
  • on-prem
  • multi-datacenter environments

What the JD emphasized

  • production Kubernetes experience
  • Python or Go for building tools and automation
  • Prometheus, Grafana, and observability-driven workflows

Other signals

  • AI inference services
  • Wafer-Scale Engine (WSE)
  • frontier-class models
  • high-performance SRE function
  • cutting-edge AI Inference infrastructure