Senior Site Reliability Engineer

Anyscale Anyscale · Data AI · San Francisco, CA +1 · Engineering

Senior Site Reliability Engineer to lead architectural strategy and operational excellence for Anyscale's global production systems, focusing on autonomous, self-healing infrastructure and establishing a culture of reliability. The role involves architecting cloud components, ensuring deployment methodologies align with reliability goals, designing observability systems, creating monitoring and alerting, establishing testing infrastructure, defining SLOs and error budgets, implementing best practices for incident management, and coordinating cloud-based service deployment.

What you'd actually do

  1. Architect and develop a unified perspective on how cloud components are utilized across the company, taking into account diverse needs and requirements.
  2. Ensure that deployment methodologies align with the company's reliability goals.
  3. Design and implement systems that promote understanding of production environments, facilitating quick identification of issues through robust observability infrastructure for metrics, logging, and tracing.
  4. Create monitoring and alerting systems at different levels, enabling teams to easily contribute and enhance the overall monitoring capabilities.
  5. Establish testing infrastructure to support the team in writing and executing tests effectively.

Skills

Required

  • Site Reliability or DevOps role experience
  • managing large-scale distributed systems and microservices architectures
  • multi-cloud environments (AWS, GCP, or Azure)
  • Python or Go programming language
  • IaC tools like Terraform
  • architecting and troubleshooting production-grade Kubernetes clusters
  • mentoring junior engineers
  • leading complex technical projects
  • influencing engineering culture without direct authority
  • leverage data from logging and tracing infrastructure

What the JD emphasized

  • proven track record in high-growth environments
  • multi-cloud environments
  • production-grade Kubernetes clusters