Senior Software Engineer, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Software Engineer to build and operate automation, tooling, and operational systems for large-scale GPU infrastructure supporting AI research and production workloads. Focus on Kubernetes, cluster operations, reliability, and automation.

What you'd actually do

  1. Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
  2. Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
  3. Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
  4. Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
  5. Participate in on-call, incident response, debugging, and durable follow-up work.

Skills

Required

  • Python
  • Go
  • Linux
  • Kubernetes
  • containers
  • cloud infrastructure
  • infrastructure automation
  • troubleshoot distributed systems

Nice to have

  • GPU infrastructure
  • Kubernetes operators
  • GitOps
  • Terraform
  • ArgoCD
  • fleet automation
  • SLOs
  • on-call
  • incident response
  • observability
  • reliability practices
  • BMaaS
  • VMaaS
  • managed Kubernetes
  • multi-cloud infrastructure

What the JD emphasized

  • 8+ years of experience building or operating production infrastructure.
  • Ability to troubleshoot distributed systems in production.
  • Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
  • Experience with SLOs, on-call, incident response, observability, and reliability practices.