What you'd actually do

Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.

Continuously improve infrastructure provisioning and operations with automation, APIs, and self‑service platforms.

Operate in a globally distributed, hybrid multi‑cloud environment (AWS, GCP, on‑prem), building systems that are cloud‑native and location‑agnostic.

Build strong cross-functional relationships and align with collaborators across various business units.

Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.

Skills

Required

Go
Java
C/C++
Scala
Python
Elixir
backend
systems
infrastructure engineering
scalability
consistency
performance trade-offs
server-side systems
horizontally scalable
resilient
low-latency services
end-to-end service ownership
architecture
build reviews
implementation
testing
rollout
observability
iterative improvement
GCP
AWS
Azure
cloud-native primitives
CI/CD
GitOps workflows
Infrastructure as Code
problem-solving skills
simplifying complex systems
B.S. in Computer Science or related field
5+ years of relevant experience
communication skills
collaboration skills
technical decision guiding

Nice to have

HPC clusters
large-scale AI/ML platforms
job schedulers
Slurm
Kubernetes
open source component maintainer

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated infrastructure to enable business critical services and AI applications. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this infrastructure. The ideal candidate is strong in software development, crafting and building reliable distributed systems, and has the ability to implement well thought out long term maintenance strategy.

What you’ll be doing:

Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
Continuously improve infrastructure provisioning and operations with automation, APIs, and self‑service platforms.
Operate in a globally distributed, hybrid multi‑cloud environment (AWS, GCP, on‑prem), building systems that are cloud‑native and location‑agnostic.
Build strong cross-functional relationships and align with collaborators across various business units.
Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.
Participate in the team’s on‑call rotation and lead high‑impact incident response when needed.

What we need to see:

Strong coding skills in at least two of: Go, Java, C/C++, Scala, Python, Elixir, with a focus on backend, systems, or infrastructure engineering.
Deep understanding of scalability, consistency, and performance trade‑offs in server‑side systems; ability to build horizontally scalable, resilient, and low‑latency services.
Experience owning services end‑to‑end: architecture, build reviews, implementation, testing, rollout, observability, and iterative improvement.
Hands‑on experience with at least one major cloud provider (GCP, AWS, or Azure) and cloud‑native primitives (managed storage, messaging, compute).
Proficiency with modern CI/CD, GitOps workflows, and Infrastructure as Code practices for safe, repeatable changes.
Bias for action, strong problem‑solving skills, and a track record of simplifying complex systems.
B.S. in Computer Science or related field (or equivalent experience), with 5+ years of relevant experience.
Careful communication and collaboration skills; comfortable guiding technical decisions across teams.

Ways to stand out from the crowd:

Prior experience building core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers (e.g., Slurm or Kubernetes).
Maintainer or co‑maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until March 13, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.