What you'd actually do

Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security.

Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services.

Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure.

Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions.

Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency.

Skills

Required

12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering, with significant focus on storage systems.
Strong hands‑on experience with design, deployment, and operations of enterprise‑grade NAS, SAN, and/or Object Storage platforms.
Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems).
Proficiency with Infrastructure as Code and configuration management tools (e.g., Terraform, Ansible, Puppet, SaltStack) and source control systems.
Experience building and operating highly available, scalable infrastructure, including automation for provisioning, monitoring, and remediation.
Experience with container and virtualization platforms (e.g., Docker, Kubernetes, hypervisors) and modern CI/CD and version control tools.
Strong scripting or programming skills (e.g., Python, Go, Shell) to build tools, automate workflows, and integrate systems.
Excellent communication and collaboration skills, with the ability to work effectively across distributed and cross‑functional teams.
Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience).

Nice to have

Experience with storage for high‑performance computing, AI/ML workloads, or large‑scale data analytics.
Proven ability to debug complex, distributed systems and storage performance issues.
History of driving reliability improvements through data‑driven analysis and automation.
Experience leading technical initiatives, mentoring engineers, or acting as a technical lead on critical projects.

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s an outstanding legacy of innovation that’s fueled by phenomenal technology – and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are seeking a Senior Site Reliability Engineer – Storage, you will own the reliability, performance, and scalability of our global NAS, SAN, and Object Storage platforms that power critical internal and external services. You will combine deep storage expertise with strong automation and SRE practices to design, build, and operate highly available storage systems at scale.

What You Will Be Doing:

Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security.
Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services.
Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure.
Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions.
Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency.
Build and maintain runbooks, standard operating procedures, and comprehensive documentation for storage services and automation.
Analyze capacity and usage trends, perform forecasting, and recommend scaling or optimization strategies to support business growth.
Collaborate closely with SRE, infrastructure, networking, and application teams in a follow‑the‑sun model to deliver consistent, high‑quality service.
Mentor junior engineers, share best practices, and help drive adoption of SRE principles across the team.

What We Need to See:

12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering, with significant focus on storage systems.
Strong hands‑on experience with design, deployment, and operations of enterprise‑grade NAS, SAN, and/or Object Storage platforms.
Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems).
Proficiency with Infrastructure as Code and configuration management tools (e.g., Terraform, Ansible, Puppet, SaltStack) and source control systems.
Experience building and operating highly available, scalable infrastructure, including automation for provisioning, monitoring, and remediation.
Experience with container and virtualization platforms (e.g., Docker, Kubernetes, hypervisors) and modern CI/CD and version control tools.
Strong scripting or programming skills (e.g., Python, Go, Shell) to build tools, automate workflows, and integrate systems.
Excellent communication and collaboration skills, with the ability to work effectively across distributed and cross‑functional teams.
Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience).

Ways to Stand Out from the Crowd:

Experience with storage for high‑performance computing, AI/ML workloads, or large‑scale data analytics.
Proven ability to debug complex, distributed systems and storage performance issues.
History of driving reliability improvements through data‑driven analysis and automation.
Experience leading technical initiatives, mentoring engineers, or acting as a technical lead on critical projects.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people in the world working for us. If you're creative and motivated, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

What You Will Be Doing:

Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security.
Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services.
Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure.
Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions.
Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency.
Build and maintain runbooks, standard operating procedures, and comprehensive documentation for storage services and automation.
Analyze capacity and usage trends, perform forecasting, and recommend scaling or optimization strategies to support business growth.
Collaborate closely with SRE, infrastructure, networking, and application teams in a follow‑the‑sun model to deliver consistent, high‑quality service.
Mentor junior engineers, share best practices, and help drive adoption of SRE principles across the team.

What We Need to See:

12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering, with significant focus on storage systems.
Strong hands‑on experience with design, deployment, and operations of enterprise‑grade NAS, SAN, and/or Object Storage platforms.
Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems).
Proficiency with Infrastructure as Code and configuration management tools (e.g., Terraform, Ansible, Puppet, SaltStack) and source control systems.
Experience building and operating highly available, scalable infrastructure, including automation for provisioning, monitoring, and remediation.
Experience with container and virtualization platforms (e.g., Docker, Kubernetes, hypervisors) and modern CI/CD and version control tools.
Strong scripting or programming skills (e.g., Python, Go, Shell) to build tools, automate workflows, and integrate systems.
Excellent communication and collaboration skills, with the ability to work effectively across distributed and cross‑functional teams.
Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience).

Ways to Stand Out from the Crowd:

Experience with storage for high‑performance computing, AI/ML workloads, or large‑scale data analytics.
Proven ability to debug complex, distributed systems and storage performance issues.
History of driving reliability improvements through data‑driven analysis and automation.
Experience leading technical initiatives, mentoring engineers, or acting as a technical lead on critical projects.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.