Senior Software Engineer, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Software Engineer role focused on building and operating large-scale GPU infrastructure for AI workloads using Kubernetes. Responsibilities include developing automation, tooling, and operational systems for cluster lifecycle management, reliability, and scalability. Requires strong programming skills (Python/Go), experience with Linux, Kubernetes, and distributed systems.

What you'd actually do

  1. Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
  2. Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
  3. Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
  4. Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
  5. Participate in on-call, incident response, debugging, and durable follow-up work.

Skills

Required

  • Python
  • Go
  • Linux
  • Kubernetes
  • containers
  • cloud infrastructure
  • infrastructure automation
  • troubleshoot distributed systems

Nice to have

  • GPU infrastructure
  • Kubernetes operators
  • GitOps
  • Terraform
  • ArgoCD
  • fleet automation
  • SLOs
  • on-call
  • incident response
  • observability
  • reliability practices
  • BMaaS
  • VMaaS
  • managed Kubernetes
  • multi-cloud infrastructure

What the JD emphasized

  • 8+ years of experience building or operating production infrastructure
  • Ability to troubleshoot distributed systems in production
  • Experience with GPU infrastructure
  • Experience with SLOs, on-call, incident response, observability, and reliability practices