Staff Software Engineer - Reliability

Rubrik Rubrik · Enterprise · Palo Alto, CA · Engineering

Staff Software Engineer - Reliability role focused on ensuring the reliability, availability, performance, and security of enterprise infrastructure services, spanning global SaaS platforms and government-compliant environments. The role involves technical leadership in distributed systems, hyperscale automation, and leading the Application-SRE team to handle customer escalations and feedback loops. Requires strong expertise in SRE principles, distributed systems, cloud infrastructure, and leadership, with a strict regulatory requirement for US Citizenship to support federal and FedRAMP environments.

What you'd actually do

  1. Formulate and execute the architectural vision for Rubrik's Cloud Platform, optimizing backend infrastructure systems like Kubernetes, MySQL, and cloud-native services for performance, security, and multi-region scale.
  2. Build, scale, and maintain sophisticated custom internal tools, platform controllers, and automation frameworks in Go or Python to systematically eliminate operational toil.
  3. Wield engineering-wide influence to create technical consensus among component, platform, and security engineering teams, effectively "shifting left" to embed structural resilience, capacity guards, and compliance from initial feature designs.
  4. Define, audit, and enforce robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across all critical enterprise platform services, translating telemetry insights into actionable product roadmaps during executive reviews.
  5. Serve as a primary Incident Commander for high-severity cloud outages, establishing roles, directing mitigation vectors under pressure, and orchestrating comprehensive, blameless post-mortems that drive durable systemic fixes.

Skills

Required

  • Golang, Python, or Java programming expertise
  • Concurrency models, data structures, and test-driven software design patterns
  • Designing, deploying, analyzing, and auditing complex, large-scale distributed systems
  • Database topologies and high-availability public cloud meshes
  • Unix/Linux operating system environments (process models, file systems, kernels)
  • Systems administration
  • Advanced L4/L7 networking protocols
  • Converting patterns from customer escalations and POCs into prioritized product and reliability feedback
  • Partnering directly with Product, Sales Engineering and Support leadership
  • Partnering directly with Sales, Support, and customers on escalations and POCs
  • Translating field signals into engineering action
  • Technical leadership
  • Mapping architectural dependencies
  • Managing multi-team technical projects
  • Guiding organizations through critical platform shifts with high technical judgment

Nice to have

  • Extensive production experience provisioning, lifecycle-managing, and recovering enterprise-scale Kubernetes (GKE, EKS) deployments
  • Large-scale relational/non-relational databases (MySQL)
  • Prior experience building, certifying, or auditing infrastructure environments under compliance structures such as FedRAMP (High/Moderate), SOC 2, ISO 27001, or CJIS
  • Fluency in Infrastructure-as-Code (Terraform, Pulumi) module design
  • Multi-tenant state isolation
  • Enterprise observability fabrics (Prometheus, Grafana, Open)

What the JD emphasized

  • Must be a US Citizen currently residing on CONUS soil (strict regulatory requirement to enable support for federal and FedRAMP environments when required).