Senior Site Reliability Engineer, Data Infrastructure

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Information Technology

This role focuses on the reliability, scalability, and security of a data platform that supports internal AI workloads. The Senior Site Reliability Engineer will own the reliability and performance of a Kubernetes-based data platform, designing and operating highly available, multi-region systems, and ensuring services meet strict uptime and latency targets. Responsibilities include scaling infrastructure, improving deployment pipelines, hardening security, and evolving DevSecOps practices.

What you'd actually do

  1. own the reliability and performance of our Kubernetes-based data platform
  2. design and operate highly available, multi-region systems, ensuring our services meet strict uptime and latency targets
  3. scaling infrastructure, improving deployment pipelines, and hardening our security posture
  4. play a key role in evolving our DevSecOps practices while partnering closely with engineering teams to ensure services are built for reliability from day one

Skills

Required

  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering roles
  • Deep expertise in Kubernetes and containerized software services, including cluster design, operations, and troubleshooting in production environments
  • Strong experience building and operating CI/CD systems, including tools such as Argo CD and GitHub Actions
  • Proven experience owning production systems with high availability requirements (≥99.99% uptime), including incident response, SLI/SLO/SLA definition, error budgets, and postmortems
  • Hands-on experience designing and operating geo-replicated, multi-region, active-active systems, including traffic routing, failover strategies, and data consistency tradeoffs
  • Strong experience building and owning observability components, including metrics, logging, and tracing (e.g., Prometheus, Grafana, OpenTelemetry)
  • Experience with infrastructure as code (e.g., Helm, Terraform, Pulumi) and automated environment provisioning
  • Strong understanding of system performance tuning, capacity planning, and resource optimization in distributed systems
  • Experience implementing and operating security best practices in cloud-native environments (e.g., secrets management, network policies, vulnerability scanning)

Nice to have

  • Experience operating data platforms or data-intensive workloads (e.g., Spark, Airflow, Kafka, Flink)
  • Familiarity with service mesh technologies (e.g., Istio, Linkerd)
  • Background in building internal developer platforms or self-service infrastructure

What the JD emphasized

  • ≥99.99% uptime
  • incident response, SLI/SLO/SLA definition, error budgets, and postmortems
  • geo-replicated, multi-region, active-active systems
  • observability components, including metrics, logging, and tracing (e.g., Prometheus, Grafana, OpenTelemetry)
  • infrastructure as code (e.g., Helm, Terraform, Pulumi)
  • security best practices in cloud-native environments