Senior Manager, Site Reliability Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Manager of Site Reliability Engineering to lead and reshape IT operations at scale, building AI-powered systems for reliability, speed, and employee experience. Focuses on transforming Incident, Problem, and Change Management using observability, AI insights, and orchestration to move towards predictive and autonomous operations.

What you'd actually do

  1. Manage the full lifecycle of Incident, Problem, and CM as a 24×7 operational function, ensuring high reliability and minimal business disruption.
  2. Transform incident response by bringing to bear AI detection, correlation, and guided remediation, reducing time to detect, respond, and resolve.
  3. Build and scale intelligent incident workflows that integrate monitoring, telemetry, and service context to enable faster and more consistent response.
  4. Evolve Problem Management into a data-driven field, using AI and analytics to identify patterns, eliminate recurring issues, and drive systemic fixes.
  5. Modernize CM by introducing risk-aware, data-driven decisioning, improving change success rates, and reducing blast radius.

Skills

Required

  • leading and managing global IT operations or service management teams
  • Site Reliability Engineering
  • IT Service Management
  • Incident Management
  • Problem Management
  • Configuration Management
  • applying AI, automation, or advanced analytics to improve operational outcomes
  • observability
  • monitoring ecosystems
  • modern reliability practices
  • SRE principles
  • SLOs
  • error budgets
  • move organizations from process-heavy to technology-focused operating models
  • leadership capability
  • building and scaling engineering-focused teams
  • executive-level communication
  • translating operational signals into clear, actionable narratives
  • build and lead a high-performing team of SREs and engineers

Nice to have

  • ITIL knowledge and/or certification
  • Experience building or scaling AI-powered operational platforms
  • Ability to challenge traditional ITSM models and introduce innovative, scalable approaches
  • A mentality passionate about automation first, prevention over reaction, and systems over process.

What the JD emphasized

  • AI-powered systems
  • AI insights
  • AI detection
  • AI and analytics
  • automation and orchestration platforms
  • observability
  • SRE principles
  • SLOs
  • error budgets

Other signals

  • AI-powered systems for reliability
  • AI insights and orchestration
  • transform incident response with AI detection, correlation, and guided remediation
  • evolve Problem Management using AI and analytics
  • adoption of observability
  • automation and orchestration platforms