Senior Site Reliability Engineer I

Axon Axon · Enterprise · Boston, MA · 1505 SAAS Ops

This role focuses on building and evolving Axon's next-generation observability platform, encompassing distributed tracing, log aggregation, and metrics infrastructure. The engineer will work with cloud-native systems, infrastructure as code, and drive adoption of modern observability practices across the organization.

What you'd actually do

  1. Own and evolve Axon's distributed tracing infrastructure, including Jaeger and OpenTelemetry-based instrumentation, driving adoption across Axon's service-oriented architecture
  2. Build and operate Axon's log aggregation platform (Grafana Loki + Alloy), expanding use cases beyond Kubernetes event logs and reducing organizational dependency on expensive third-party log tooling (including Splunk)
  3. Maintain and improve Axon's metrics infrastructure (Cortex, Prometheus, Grafana) — the foundation for alerting, dashboards, and SLO tracking across all of Axon's environments
  4. Write internal tooling and automation that makes observability self-service: toolkit commands, agentic on-call helpers, runbook generation, and dashboard scaffolding
  5. Manage observability infrastructure as code via Terraform, CDK, ArgoCD, and Helm — including capacity management, cybersecurity requirements and compliance, and on-call rotation participation

Skills

Required

  • Linux systems fundamentals
  • Kubernetes
  • Loki
  • Grafana
  • Tempo/Jaeger
  • Mimir/Cortex
  • Terraform
  • Golang
  • Python
  • Java
  • CJIS clearance

Nice to have

  • OpenTelemetry
  • GitOps
  • 24/7 high-volume systems
  • agentic AI tooling
  • LLM-powered developer tools
  • complex multi-service distributed systems

What the JD emphasized

  • 7+ years of experience in SRE, platform engineering, or infrastructure engineering
  • United States citizen — able to gain CJIS clearance for full US production access