Senior Site Reliability Engineer, AI Factory

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

This role focuses on the Site Reliability Engineering (SRE) of large-scale, GPU-accelerated data centers that serve as AI factories. The primary responsibility is to architect, automate, and operate these datacenters, ensuring their resilience, scalability, and efficiency. This involves managing GPU systems, firmware, monitoring hardware, triaging issues, and collaborating with software and hardware teams to define operational strategies and procedures. The role emphasizes maintaining high availability and performance through robust processes and rapid response to incidents, ultimately supporting the foundation of global AI infrastructure.

What you'd actually do

  1. Running commissioning and provisioning for GPU systems
  2. Running the firmware versions of equipment and components, and communicating the supported versions across the organization
  3. Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability.
  4. Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly
  5. Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions.

Skills

Required

  • BS or MS degree in Computer Engineering/Science, or related field (or equivalent experience) with 10+ overall years of meaningful work experience
  • Experience managing GPU Fleets
  • 10+ years of expertise in improving data center operations or critical infrastructure.
  • Expertise in BMS & Power management.
  • Background in working with Provisioning, Commissioning, and Config Management solutions
  • Experience working with Packer and developing QCOW2 images
  • Background in coordinating with remote hands
  • Experience working with Datacenter Inventory Management Systems like Netbox, Nautilus, or others.
  • Proven track record of working with multiple teams to achieve operational excellence for an organization
  • Experience driving reliability with robust processes, rapid field response, and recovery

Nice to have

  • History of involvement with Automated Break-Fix solutions at scale
  • Familiarity with handling a Message Bus and Workflow Engine
  • Hands-on involvement with Zero Touch Provisioning solutions for the network and host

What the JD emphasized

  • GPU systems
  • GPU Fleets
  • data center operations
  • critical infrastructure
  • Automated Break-Fix solutions at scale