Infrastructure Reliability Engineer

Anduril Anduril · Defense · Costa Mesa, CA · Corporate Technology : Infrastructure Engineering

Anduril Industries, a defense technology company, is seeking an Infrastructure Reliability Engineer to manage and operate core developer tools and infrastructure. The role involves owning the full lifecycle of self-hosted tools, designing automation for patching and upgrades, scaling infrastructure, and ensuring reliability through SRE practices. This position requires experience with Docker, Kubernetes, cloud platforms, and Infrastructure-as-Code, with a focus on automation and end-to-end system ownership. The company leverages AI and advanced technologies to transform military capabilities.

What you'd actually do

  1. Own the lifecycle of core self-hosted developer tools (e.g., GitHub Enterprise Server, CircleCI, JFrog Artifactory/Xray)
  2. Design and implement automated systems for patching, backups (with validation), and upgrades
  3. Scale infrastructure to support a fast-growing engineering org
  4. Use Infrastructure-as-Code (Terraform) to manage environments
  5. Operate and troubleshoot systems using Docker, Kubernetes, and cloud platforms (AWS, GCP, Azure)

Skills

Required

  • Operating production systems using Docker and Kubernetes
  • Proficiency with at least one cloud platform (AWS, GCP, or Azure)
  • Managing infrastructure with Infrastructure-as-Code tools (e.g., Terraform)
  • Strong problem-solving skills with a focus on automation
  • Scripting or software development experience (e.g., Python, Go, Bash)
  • Familiarity with CI/CD pipelines and developer tooling
  • Ability to own systems end-to-end, from design to incident resolution
  • Eligible to obtain and maintain an active U.S. Secret security clearance

Nice to have

  • Prior experience with GitHub Enterprise Server, JFrog Artifactory/Xray, or CircleCI
  • Experience maintaining highly available, scalable internal tools
  • Exposure to security best practices, compliance requirements, or auditing
  • Experience supporting large, rapidly scaling engineering organizations
  • Experience with monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana)
  • Background in SRE or hybrid SWE/DevOps roles
  • Experience with on-prem infrastructure operations, reliability, or capacity planning

What the JD emphasized

  • U.S. Secret security clearance