Senior Site Reliability Engineer I

Axon Axon · Enterprise · Boston, MA · 1505 SAAS Ops

Senior Site Reliability Engineer focused on building and evolving Axon's next-generation observability platform, including distributed tracing, log aggregation, and metrics infrastructure. The role involves managing infrastructure as code and partnering with engineering teams to drive adoption of observability practices.

What you'd actually do

  1. Own and evolve Axon's distributed tracing infrastructure, including Jaeger and OpenTelemetry-based instrumentation, driving adoption across Axon's service-oriented architecture
  2. Build and operate Axon's log aggregation platform (Grafana Loki + Alloy), expanding use cases beyond Kubernetes event logs and reducing organizational dependency on expensive third-party log tooling (including Splunk)
  3. Maintain and improve Axon's metrics infrastructure (Cortex, Prometheus, Grafana) — the foundation for alerting, dashboards, and SLO tracking across all of Axon's environments
  4. Write internal tooling and automation that makes observability self-service: toolkit commands, agentic on-call helpers, runbook generation, and dashboard scaffolding
  5. Manage observability infrastructure as code via Terraform, CDK, ArgoCD, and Helm — including capacity management, cybersecurity requirements and compliance, and on-call rotation participation

Skills

Required

  • Linux systems fundamentals
  • Kubernetes
  • Loki
  • Grafana
  • Tempo/Jaeger
  • Mimir/Cortex
  • Terraform
  • Golang
  • Python
  • Java
  • CJIS clearance

Nice to have

  • OpenTelemetry
  • GitOps
  • 24/7 high-volume systems
  • agentic AI tooling
  • LLM-powered developer tools
  • complex multi-service distributed systems

What the JD emphasized

  • 7+ years of experience in SRE, platform engineering, or infrastructure engineering
  • United States citizen — able to gain CJIS clearance for full US production access