Senior Service Reliability Engineer - Eda Infrastructure

NVIDIA NVIDIA · Semiconductors · Bangalore, India

This role is for a Senior Service Reliability Engineer focused on operating and maintaining hardware infrastructure for NVIDIA's global Service Reliability Operations Center. The primary responsibilities include ensuring scalability, resilience, and high availability of large-scale production compute and storage environments, utilizing monitoring and observability tools to detect and respond to incidents, and collaborating with SRE, Security, and DevOps teams. The role requires strong Linux system administration, automation skills, and experience in high-availability environments.

What you'd actually do

  1. Monitor and manage large-scale production compute and storage environments to ensure high availability and performance
  2. Utilize alerts, alarms, and observability tools to proactively detect, prevent, and respond to incidents
  3. Apply deep systems knowledge to analyze logs, metrics, and system behavior to diagnose issues, identify root causes, and implement effective resolutions
  4. Collaborate with SRE, Security, and DevOps to improve reliability, reduce incident frequency and impact, and drive rapid resolution when issues occur
  5. Partner with development teams to implement monitoring, alerting, and observability solutions that proactively detect issues and enhance the customer experience

Skills

Required

  • Linux system administration
  • Automation using Ansible and/or Python
  • Systems Administration
  • SRE
  • NOC

Nice to have

  • Kubernetes
  • SLURM
  • large-scale cluster management
  • GPU hardware
  • high-performance computing environments
  • observability tools
  • incident management tools
  • Grafana
  • OpenTelemetry
  • PagerDuty
  • JIRA
  • Cloud experience (AWS, Azure, GCP)
  • on-prem expertise

What the JD emphasized

  • 5+ years of experience administering large-scale production systems
  • 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC)
  • Expert-level knowledge of Linux system administration and automation using Ansible and/or Python