Sre

Baseten · Data AI · San Francisco, CA · EPD

Site Reliability Engineer to define and codify gold standards for day 2 operations of an ML infrastructure platform, focusing on robust systems, processes, automations, and observability to ensure reliability at scale and empower the organization. The role involves incident response, building observability tooling, and diagnosing runtime issues related to ML model deployment.

What you'd actually do

  1. Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
  2. Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
  3. Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
  4. Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
  5. Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.

Skills

Required

  • Kubernetes
  • scalable infrastructure
  • observability tooling
  • infrastructure-as-code
  • GitOps workflows
  • runbooks
  • incident response
  • post-mortem analysis

Nice to have

  • multi-cloud experience across EKS, GKE, or similar
  • Observability-as-code
  • incident management platforms

What the JD emphasized

  • multi-cloud Kubernetes infrastructure
  • observability tooling
  • runbooks
  • incident response
  • model lifecycle management