Senior Site Reliability Engineer - Datacenter Automation

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking an experienced Senior Site Reliability Engineer to scale its AI Infrastructure, focusing on production systems for large GPU clusters used in AI workloads. The role involves implementing monitoring, health management, and automation for GPU asset provisioning, configuration, and lifecycle management across cloud providers, ensuring reliability, availability, and scalability. The engineer will collaborate with teams to maintain reliable and performant AI clusters, evaluate system failures, and improve services.

What you'd actually do

  1. You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
  2. Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.
  3. Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance.

Skills

Required

  • site reliability principles
  • techniques including reliability assessments
  • incident management processes
  • production system observability
  • monitoring and alerting
  • automated deployments
  • toil elimination
  • software engineering discipline
  • systems programming language (Go, Python)
  • solid understanding of data structures and algorithms
  • DevOps/SRE role
  • large-scale production systems

Nice to have

  • managing and automating large-scale distributed systems independent of cloud providers
  • Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager)
  • Proven operational excellence in maintaining reliable and performant AI infrastructure

What the JD emphasized

  • reliability assessments
  • incident management processes
  • production system observability
  • monitoring and alerting
  • automated deployments
  • toil elimination
  • production systems
  • reliability
  • availability
  • scalability
  • AI infrastructure
  • reliable and performant AI infrastructure

Other signals

  • AI infrastructure
  • GPU clusters
  • large scalable GPU clusters
  • AI workloads
  • production AI clusters