Site Reliability Engineer (aht)

Northrop Grumman Northrop Grumman · Aerospace · Beavercreek, OH +1 · Software

Site Reliability Engineer role focused on defining and measuring reliability targets, building automation and tooling, and leading incident response for distributed systems. Requires strong systems-thinking, observability fundamentals, and software engineering skills.

What you'd actually do

  1. Lead real time detection, triage, and resolution of production incidents; conduct post mortems and drive corrective actions.
  2. Identify repetitive operational work, develop automation and runbooks, and implement CI/CD pipelines to reduce manual effort
  3. Define service level objectives (SLOs) and error budget policies; assess system reliability against those targets using observability data
  4. Build and maintain shared tooling (e.g., Kubernetes clusters, GitOps workflows); enable development teams with SDKs, instrumentation guidance, and reliability best practices

Skills

Required

  • Bachelor’s degree in Computer Science or related STEM degree
  • Systems‑thinking mindset
  • Observability fundamentals
  • Basic software‑engineering skills
  • Linux and networking fundamentals
  • Strong communication, collaboration, and organizational abilities
  • Kubernetes
  • Argo CD/GitOps
  • disaster recovery planning
  • capacity forecasting
  • OpenTelemetry standards
  • Grafana/Perses
  • Tempo
  • ClickHouse
  • VictoriaMetrics
  • Scripting
  • CI/CD pipeline development
  • runbook automation
  • DevOps practices
  • Instrumentation SDKs
  • onboarding of SRE practices for engineering teams
  • High quality dashboards
  • alert design
  • anomaly detection techniques

Nice to have

  • SRE related certifications
  • Python
  • Go
  • GitLab/GitHub
  • Jenkins
  • Docker
  • Locust/Gatling
  • Prometheus
  • container orchestration
  • service mesh
  • cloud native infrastructure
  • security best practices for cloud and on prem environments

What the JD emphasized

  • Systems‑thinking mindset
  • Observability fundamentals
  • Basic software‑engineering skills
  • Proven track record of driving reliability improvements in large scale, distributed systems