Senior Site Reliability Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Site Reliability Engineer to manage NVIDIA's on-prem infrastructure, ensuring uptime, reliability, and readiness of engineering cloud services. The role involves deploying and managing applications on Kubernetes, implementing monitoring and alerting, capacity planning, and driving automation. A key aspect is leveraging AI techniques to analyze machine and job data for operational insights.

What you'd actually do

  1. Manage NVIDIA's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.
  2. Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure alignment to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches.
  3. Deploy, configure, and manage applications and services on Kubernetes clusters. Implement logging, monitoring, and alerting solutions (e.g., Prometheus, Grafana, ELK/EFK). Ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads.
  4. Help in capacity planning, optimization and better utilization efforts.
  5. Support user reported issues & issues. Monitor alerts and take necessary action. Actively participate in WAR room for critical issues

Skills

Required

  • Experience of maintaining cloud infrastructure and highly-available production environment.
  • Experience handling and maintaining systems installed in on-premises data centers, with strong hands-on proficiency using BMC interfaces (Redfish), KVM, and IPMI tools for hardware provisioning, remote access, and troubleshooting.
  • Proven background working with databases, including relational databases such as SQL/MySQL, as well as time-series databases like Prometheus, with experience in data querying & performance tuning.
  • Solid understanding of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs, with the ability to diagnose connectivity issues and support complex, distributed systems.
  • Practical experience in working with data analytics and visualization tools such as Kibana, Grafana, Splunk, or similar platforms, applied to analyze logs, metrics, and system behavior for monitoring and troubleshooting purposes.
  • Strong demonstrable experience in automation tools like Jenkins and/or Temporal along with configuration tools like Ansible.
  • Proficiency with Kubernetes, Docker, and virtualization technologies, with experience deploying, managing, and operating containerized workloads and virtualized infrastructure in production environments.
  • Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations across infrastructure and application layers.
  • 5+ years of demonstrable experience.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Nice to have

  • Knowledge and understanding of Openstack architecture and services is a plus.
  • Previous experience with SRE teams managing on-prem infrastructure.
  • Experience managing NVIDIA hardware like GPUs and Tegras.

What the JD emphasized

  • 5+ years of demonstrable experience