Site Reliability Engineer

Cognition Cognition · Coding AI · San Francisco, CA · Research & Development

Site Reliability Engineer for an applied AI lab building end-to-end software agents (Devin, Windsurf). The role focuses on ensuring the production reliability of user-facing products and the platform engineering that supports rapid, confident shipping. Responsibilities include defining SLOs, leading incident response, owning CI/CD pipelines, managing cloud infrastructure as code, capacity planning, and integrating security with reliability. The ideal candidate has deep experience running production systems at scale, strong software engineering fundamentals, and proficiency with cloud infrastructure and CI/CD.

What you'd actually do

  1. Define and own SLOs, SLIs, and error budgets for Devin and Windsurf. Build the monitoring, alerting, and observability systems that give the team a clear, honest picture of service health at all times.
  2. Lead incident response with speed and clarity. Run blameless postmortems that turn outages into durable improvements. Build the runbooks and tooling that make on-call sustainable and effective.
  3. Own the deployment pipelines, release infrastructure, and internal developer tooling that let the team ship fast without breaking things. Reduce toil systematically so engineers spend time on work that matters.
  4. Manage cloud infrastructure through code. Build reproducible, auditable, version-controlled environments that scale with the product and eliminate configuration drift.
  5. Model growth, forecast resource needs, and ensure the infrastructure stays ahead of demand. Profile and improve system performance before users feel it.

Skills

Required

  • SLOs
  • SLIs
  • error budgets
  • on-call rotations
  • incident command
  • software engineering fundamentals
  • cloud infrastructure (AWS, GCP, or Azure)
  • Kubernetes
  • Terraform
  • CI/CD pipelines
  • deployment infrastructure
  • instrumentation
  • dashboards
  • alerting
  • automation
  • incident detection
  • triage
  • mitigation
  • resolution
  • postmortem

Nice to have

  • developer-facing products or platforms

What the JD emphasized

  • production systems at scale
  • SRE at Cognition means writing real code, not just configuring tools
  • cloud infrastructure (AWS, GCP, or Azure)
  • container orchestration (Kubernetes)
  • infrastructure as code (Terraform or equivalent)
  • CI/CD pipelines
  • deployment infrastructure
  • observability instincts
  • reducing toil systematically through automation
  • owning incidents end to end
  • developer-facing products or platforms