Senior Site Reliability Engineering - Storage

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking a Senior Site Reliability Engineer specializing in Storage to ensure the reliability, performance, and scalability of their global NAS, SAN, and Object Storage platforms. The role involves leading design, deployment, and operations, developing automation for infrastructure management, participating in incident response, defining SLOs/SLIs, and collaborating with cross-functional teams. Experience with storage for AI/ML workloads is a plus.

What you'd actually do

  1. Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security.
  2. Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services.
  3. Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure.
  4. Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions.
  5. Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency.

Skills

Required

  • 12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering, with significant focus on storage systems.
  • Strong hands‑on experience with design, deployment, and operations of enterprise‑grade NAS, SAN, and/or Object Storage platforms.
  • Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems).
  • Proficiency with Infrastructure as Code and configuration management tools (e.g., Terraform, Ansible, Puppet, SaltStack) and source control systems.
  • Experience building and operating highly available, scalable infrastructure, including automation for provisioning, monitoring, and remediation.
  • Experience with container and virtualization platforms (e.g., Docker, Kubernetes, hypervisors) and modern CI/CD and version control tools.
  • Strong scripting or programming skills (e.g., Python, Go, Shell) to build tools, automate workflows, and integrate systems.
  • Excellent communication and collaboration skills, with the ability to work effectively across distributed and cross‑functional teams.
  • Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience).

Nice to have

  • Experience with storage for high‑performance computing, AI/ML workloads, or large‑scale data analytics.
  • Proven ability to debug complex, distributed systems and storage performance issues.
  • History of driving reliability improvements through data‑driven analysis and automation.
  • Experience leading technical initiatives, mentoring engineers, or acting as a technical lead on critical projects.