Senior Site Reliability Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Senior Site Reliability Engineer (SRE) focused on improving incident detection, response, and prevention at scale. The role involves leading major incidents, building automation for incident management, enhancing observability, and driving root cause analysis. A key aspect is leveraging AI and data-driven techniques to improve operational workflows, incident triage, and decision support.

What you'd actually do

  1. Lead major incidents end to end, driving triage, cross-team coordination, decision making, and executive communication
  2. Build and implement automation for incident detection, triage, communication, and remediation
  3. Improve observability and signal quality to enable earlier detection and reduce reliance on user-reported issues
  4. Drive root cause analysis and translate learnings into systemic fixes, automation, and prevention mechanisms
  5. Leverage AI and data-driven techniques to enhance incident triage, summarization, and decision support

Skills

Required

  • Site Reliability Engineering
  • Production Engineering
  • Incident Management
  • Incident Commander
  • Distributed systems
  • Monitoring
  • Reliability engineering principles
  • Python
  • Java
  • C++
  • AI/ML concepts

Nice to have

  • building incident management or automation platforms (chatops, workflow orchestration, alert intelligence)
  • applying AI to operations, such as intelligent triage, RCA generation, or signal correlation
  • reducing MTTD and MTTR through engineering and automation
  • balancing real-time incident leadership with building scalable, long-term reliability solutions

What the JD emphasized

  • 8+ years of experience in Site Reliability Engineering, Production Engineering, or Incident Management roles
  • Proven experience acting as an Incident Commander or leading major incident response in complex environments
  • Experience or familiarity with AI/ML concepts (e.g., LLMs, anomaly detection, or data-driven operations) applied to operational workflows
  • Hands-on experience applying AI to operations, such as intelligent triage, RCA generation, or signal correlation
  • Track record of reducing MTTD and MTTR through engineering and automation