Senior Manager, Cloud Platform & Site Reliability

Baseten Baseten · Data AI · San Francisco, CA · EPD

Senior Manager role leading Cloud Platform and Site Reliability Engineering for an AI infrastructure company. Focuses on managing teams, setting technical direction for infrastructure, reliability, and platform engineering, and ensuring the health of the cloud infrastructure and SRE practice. Requires expertise in Kubernetes, cloud infrastructure, distributed systems, IaC, CI/CD, and observability. Bonus for experience with AI/ML workloads, GPU infrastructure, and AI-assisted incident tooling.

What you'd actually do

  1. Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement.
  2. Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments.
  3. Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews.
  4. Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements.
  5. Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through.

Skills

Required

  • managing managers
  • leading multiple high-performing infrastructure, platform, or SRE teams
  • Kubernetes (multi-cloud)
  • cloud infrastructure
  • distributed systems
  • infrastructure-as-code (Terraform, Pulumi)
  • CI/CD tooling (GitHub Actions, GitLab CI, Jenkins)
  • GitOps workflows (Flux CD, ArgoCD, Helm)
  • observability tooling (Prometheus, VictoriaMetrics, Loki, ELK, Grafana, OpenTelemetry)
  • SLOs/SLIs
  • incident management
  • enterprise SLAs
  • multi-stakeholder technical initiatives
  • communication skills
  • executive presence

Nice to have

  • running high-performance AI models and workloads
  • troubleshooting ML pipelines
  • GPU infrastructure
  • fractional GPU provisioning
  • multi-node model serving
  • incident management platforms (incident.io, PagerDuty)
  • AI-assisted tooling for incident triage and response
  • scaling an SRE practice
  • defining runbook standards
  • building self-healing automations
  • converting high-frequency failure patterns into systematic mitigations

What the JD emphasized

  • enterprise customer requirements
  • strict SLAs
  • high-severity production issues
  • multi-cloud capacity
  • GPU inference infrastructure
  • observability platforms