Senior Site Reliability Engineer, Production Engineering

Anduril Anduril · Defense · Seattle, WA · Software : Software Platform : SIE

Senior Site Reliability Engineer for Anduril's defense technology company, focusing on ensuring the reliability, performance, and scalability of mission-critical AI-powered systems (Lattice OS). The role involves designing monitoring, driving incident response, building automation, establishing SLOs, partnering with engineering teams on reliability, developing capacity plans, creating runbooks, leading deployment safety efforts, implementing security best practices, and building operational efficiency tooling. Requires 7+ years of engineering experience with 3+ in SRE/operations, Kubernetes expertise, strong programming skills (Go, Python, Rust, Java), observability stack experience, cloud platform knowledge, and distributed systems debugging skills. Must be a U.S. Person eligible for a Secret security clearance.

What you'd actually do

  1. Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform
  2. Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues
  3. Build and maintain infrastructure automation using tools like Terraform, Kubernetes operators, and custom tooling to manage large-scale distributed systems
  4. Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability
  5. Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers, graceful degradation, and chaos engineering

Skills

Required

  • 7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes)
  • Strong programming skills in one or more languages such as Go, Python, Rust, or Java with ability to build production-grade tooling
  • Proven experience designing and implementing observability stacks (metrics, logging, tracing) using tools like Prometheus, Grafana, ELK/EFK, or equivalent
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices
  • Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack
  • Track record of improving system reliability through architectural changes, not just operational band-aids
  • Strong incident management and communication skills, with experience leading responses to critical outages
  • U.S. Person status
  • Eligibility for U.S. Secret security clearance

Nice to have

  • Experience with defense, aerospace, or other mission-critical systems where downtime has severe consequences
  • Expertise in performance optimization and capacity planning for high-throughput, low-latency systems
  • Knowledge of chaos engineering principles and experience implementing resilience testing frameworks
  • Experience with service mesh technologies (Istio, Linkerd) and advanced traffic management patterns
  • Background in database operations and optimization (PostgreSQL, Cassandra, or similar at scale)
  • Familiarity with CI/CD platforms and deployment automation (ArgoCD, FluxCD, Spinnaker, Jenkins)
  • Understanding of networking fundamentals including load balancing, DNS, TLS/SSL, and network security
  • Experience with configuration management and secrets management solutions (Vault, Sealed Secrets, SOPS)
  • Strong written and verbal communication skills with ability to explain technical concepts to non-technical stakeholders

What the JD emphasized

  • Must be a U.S. Person due to required access to U.S. export controlled information or facilities
  • Eligible to obtain and maintain an active U.S. Secret security clearance