Senior Software Engineer, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Software Engineer role focused on building and operating large-scale GPU infrastructure for AI workloads using Kubernetes. Responsibilities include developing automation, tooling, and operational systems for cluster lifecycle management, reliability, and scalability. Requires strong programming skills, experience with Linux, Kubernetes, and distributed systems.

What you'd actually do

  1. Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
  2. Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
  3. Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
  4. Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
  5. Participate in on-call, incident response, debugging, and durable follow-up work.

Skills

Required

  • Python
  • Go
  • Linux
  • Kubernetes
  • containers
  • cloud infrastructure
  • infrastructure automation
  • troubleshoot distributed systems

Nice to have

  • GPU infrastructure
  • Kubernetes operators
  • GitOps
  • Terraform
  • ArgoCD
  • fleet automation
  • SLOs
  • on-call
  • incident response
  • observability
  • reliability practices
  • BMaaS
  • VMaaS
  • managed Kubernetes
  • multi-cloud infrastructure

What the JD emphasized

  • 8+ years of experience building or operating production infrastructure.
  • Ability to troubleshoot distributed systems in production.
  • Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
  • Experience with SLOs, on-call, incident response, observability, and reliability practices.