Site Reliability Engineer (aht) at Northrop Grumman

What you'd actually do

Lead real time detection, triage, and resolution of production incidents; conduct post mortems and drive corrective actions.

Identify repetitive operational work, develop automation and runbooks, and implement CI/CD pipelines to reduce manual effort

Define service level objectives (SLOs) and error budget policies; assess system reliability against those targets using observability data

Build and maintain shared tooling (e.g., Kubernetes clusters, GitOps workflows); enable development teams with SDKs, instrumentation guidance, and reliability best practices

Skills

Required

Bachelor’s degree in Computer Science or related STEM degree
Systems‑thinking mindset
Observability fundamentals
Basic software‑engineering skills
Linux and networking fundamentals
Strong communication, collaboration, and organizational abilities
Kubernetes
Argo CD/GitOps
disaster recovery planning
capacity forecasting
OpenTelemetry standards
Grafana/Perses
Tempo
ClickHouse
VictoriaMetrics
Scripting
CI/CD pipeline development
runbook automation
DevOps practices
Instrumentation SDKs
onboarding of SRE practices for engineering teams
High quality dashboards
alert design
anomaly detection techniques

Nice to have

SRE related certifications
Python
Go
GitLab/GitHub
Jenkins
Docker
Locust/Gatling
Prometheus
container orchestration
service mesh
cloud native infrastructure
security best practices for cloud and on prem environments

RELOCATION ASSISTANCE: Relocation assistance may be available

CLEARANCE REQUIRED FOR START: No

CLEARANCE TYPE: Top Secret

TRAVEL: Yes, 10% of the Time

Description

At Northrop Grumman, our employees have incredible opportunities to work on revolutionary systems that impact people's lives around the world today, and for generations to come. Our pioneering and inventive spirit has enabled us to be at the forefront of many technological advancements in our nation's history - from the first flight across the Atlantic Ocean, to stealth bombers, to landing on the moon. We look for people who have bold new ideas, courage and a pioneering spirit to join forces to invent the future, and have fun along the way. Our culture thrives on intellectual curiosity, cognitive diversity and bringing your whole self to work — and we have an insatiable drive to do what others think is impossible. Our employees are not only part of history, they're making history.

Northrop Grumman Defense Systems (NGDS), Beavercreek Ohio, is seeking a Site Reliability Engineer to define what "reliable enough" means from the user’s perspective, instrumenting and measuring against those targets, and building the tooling and runbooks that make failure recoverable. Candidates will partner with dev teams pushing operational quality upstream before code ships and lead problem resolution in production. SREs are comfortable debugging distributed systems, resolving incidents, and translating findings into lasting reliability improvements. They will work closely with software developer teams accomplishing the following:

Incident Response – Lead real time detection, triage, and resolution of production incidents; conduct post mortems and drive corrective actions. Complete work independently and as a part of an Agile team
Toil Reduction – Identify repetitive operational work, develop automation and runbooks, and implement CI/CD pipelines to reduce manual effort
Reliability Evaluations – Define service level objectives (SLOs) and error budget policies; assess system reliability against those targets using observability data
Platform Enablement – Build and maintain shared tooling (e.g., Kubernetes clusters, GitOps workflows); enable development teams with SDKs, instrumentation guidance, and reliability best practices

This requisition may be filled at a higher level based on qualifications listed below and is contingent on funding.

**Basic Qualifications: **

Engineer (Level 2): 2+ years related experience with Bachelor’s degree in Computer Science or related STEM degree from an accredited institution; 0 years with Master’s degree
**Principal Engineer (Level 3): **5+ years related experience with Bachelor’s degree in Computer Science or related STEM degree from an accredited institution; 3 years with Master’s degree
U.S. Citizenship and ability to obtain a Top‑Secret security clearance
Systems‑thinking mindset – understand how components fail together and assess blast radius
Observability fundamentals – beyond the three signals, know how to use telemetry to optimize services and engineers’ quality of life
Basic software‑engineering skills – build automation, non‑trivial APIs, follow Git workflows, and actively participate in code reviews
Linux and networking fundamentals
Strong communication, collaboration, and organizational abilities
Specialty Skills (1 or more):
- Platform & Infrastructure – Kubernetes, Argo CD/GitOps, disaster recovery planning, capacity forecasting
- **Observability **– OpenTelemetry standards, Grafana/Perses, Tempo, ClickHouse, VictoriaMetrics
- Automation & Toil Reduction – Scripting, CI/CD pipeline development, runbook automation, “DevOps” practices
- Developer Enablement – Instrumentation SDKs, onboarding of SRE practices for engineering teams
- Data & Alerting – High quality dashboards, alert design, anomaly detection techniques

**Preferred Qualifications: **

SRE related certifications (e.g., DevOps Institute, AWS Solutions Architect, or equivalent)
Hands on experience with: Python, Go, Kubernetes, Argo CD, GitLab/GitHub, Jenkins, Docker, Locust/Gatling, Prometheus, Grafana/Perses
Experience with container orchestration, service mesh, and cloud native infrastructure
Proven track record of driving reliability improvements in large scale, distributed systems
Familiarity with security best practices for cloud and on prem environments.

Primary Level Salary Range: $83,400.00 - $125,200.00

Secondary Level Salary Range: $103,600.00 - $155,400.00

The above salary range represents a general guideline; however, Northrop Grumman considers a number of factors when determining base salary offers such as the scope and responsibilities of the position and the candidate's experience, education, skills and current market conditions.

Depending on the position, employees may be eligible for overtime, shift differential, and a discretionary bonus in addition to base pay. Annual bonuses are designed to reward individual contributions as well as allow employees to share in company results. Employees in Vice President or Director positions may be eligible for Long Term Incentives. In addition, Northrop Grumman provides a variety of benefits including health insurance coverage, life and disability insurance, savings plan, Company paid holidays and paid time off (PTO) for vacation and/or personal business.

The application period for the job is estimated to be 20 days from the job posting date. However, this timeline may be shortened or extended depending on business needs and the availability of qualified candidates.

Northrop Grumman is an Equal Opportunity Employer, making decisions without regard to race, color, religion, creed, sex, sexual orientation, gender identity, marital status, national origin, age, veteran status, disability, or any other protected class. For our complete EEO and pay transparency statement, please visit http://www.northropgrumman.com/EEO. U.S. Citizenship is required for all positions with a government clearance and certain other restricted positions.

Description

Incident Response – Lead real time detection, triage, and resolution of production incidents; conduct post mortems and drive corrective actions. Complete work independently and as a part of an Agile team

Toil Reduction – Identify repetitive operational work, develop automation and runbooks, and implement CI/CD pipelines to reduce manual effort

Reliability Evaluations – Define service level objectives (SLOs) and error budget policies; assess system reliability against those targets using observability data

Platform Enablement – Build and maintain shared tooling (e.g., Kubernetes clusters, GitOps workflows); enable development teams with SDKs, instrumentation guidance, and reliability best practices

This requisition may be filled at a higher level based on qualifications listed below and is contingent on funding.

**Basic Qualifications: **

Engineer (Level 2): 2+ years related experience with Bachelor’s degree in Computer Science or related STEM degree from an accredited institution; 0 years with Master’s degree

**Principal Engineer (Level 3): **5+ years related experience with Bachelor’s degree in Computer Science or related STEM degree from an accredited institution; 3 years with Master’s degree

U.S. Citizenship and ability to obtain a Top‑Secret security clearance

Systems‑thinking mindset – understand how components fail together and assess blast radius

Observability fundamentals – beyond the three signals, know how to use telemetry to optimize services and engineers’ quality of life

Basic software‑engineering skills – build automation, non‑trivial APIs, follow Git workflows, and actively participate in code reviews

Linux and networking fundamentals

Strong communication, collaboration, and organizational abilities

Specialty Skills (1 or more):

Platform & Infrastructure – Kubernetes, Argo CD/GitOps, disaster recovery planning, capacity forecasting
**Observability **– OpenTelemetry standards, Grafana/Perses, Tempo, ClickHouse, VictoriaMetrics
Automation & Toil Reduction – Scripting, CI/CD pipeline development, runbook automation, “DevOps” practices
Developer Enablement – Instrumentation SDKs, onboarding of SRE practices for engineering teams
Data & Alerting – High quality dashboards, alert design, anomaly detection techniques

**Preferred Qualifications: **

SRE related certifications (e.g., DevOps Institute, AWS Solutions Architect, or equivalent)

Hands on experience with: Python, Go, Kubernetes, Argo CD, GitLab/GitHub, Jenkins, Docker, Locust/Gatling, Prometheus, Grafana/Perses

Experience with container orchestration, service mesh, and cloud native infrastructure

Proven track record of driving reliability improvements in large scale, distributed systems

Familiarity with security best practices for cloud and on prem environments.

Primary Level Salary Range: $83,400.00 - $125,200.00

Secondary Level Salary Range: $103,600.00 - $155,400.00

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Description

Description