Site Reliability Engineer - Hardware Infrastructure

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Site Reliability Engineer focused on hardware infrastructure, responsible for defining, developing, and supporting large-scale production systems with high efficiency and availability. The role involves incident management, root cause analysis, defining reliability metrics, and applying automation and Generative AI/Agentic solutions to improve operations.

What you'd actually do

  1. Develop and support guidelines for incident management, planned maintenance, and blameless postmortems.
  2. Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions.
  3. Define reliability and supportability metrics, Service Level Objectives, and error budgets.
  4. Develop and drive the adoption of actionable, customer-centric monitoring and alerting.
  5. Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support.

Skills

Required

  • Computer Science degree or equivalent experience
  • 8+ years of experience in SRE, DevOps, or Production Engineering
  • Strong understanding of SRE principles
  • Experience crafting and deploying fault-tolerant, performant, and supportable systems
  • Infrastructure automation
  • Experience running critical services in production
  • Python
  • Go
  • Perl
  • Ruby
  • Observability platforms (Prometheus, Grafana)
  • Communication skills
  • Adaptability

Nice to have

  • Expertise in establishing incident management and postmortem processes
  • Experience driving adoption of common tools and processes across diverse groups
  • Experience working with LLM/Generative AI/Agentic solutions
  • Hands-on expertise operating and scaling distributed systems with tight SLAs

What the JD emphasized

  • 8+ years of experience in SRE, DevOps, or Production Engineering
  • Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs
  • Experience crafting and deploying systems that are fault-tolerant, performant, and supportable
  • Experience running critical services in production
  • Hands-on experience with observability platforms (e.g., Prometheus, Grafana)
  • Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met