Principal Software Engineer, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

NVIDIA is seeking Principal Software Engineers to lead the technical direction for production engineering, automation, and reliability of large-scale GPU infrastructure, focusing on Kubernetes-based operations and distributed systems.

What you'd actually do

  1. Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
  2. Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
  3. Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments.
  4. Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows.
  5. Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance.

Skills

Required

  • Kubernetes
  • Linux
  • infrastructure automation
  • production operations
  • Go
  • Python
  • distributed systems
  • SLOs
  • observability
  • incident response

Nice to have

  • GPU clusters
  • AI/ML infrastructure
  • Kubernetes operators
  • GitOps
  • BMaaS/VMaaS
  • managed Kubernetes
  • multi-cloud fleet operations
  • internal platforms
  • control planes
  • lifecycle automation
  • production readiness frameworks

What the JD emphasized

  • 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure
  • Deep experience with Kubernetes, Linux, infrastructure automation, and production operations
  • Strong programming experience in Go, Python, or similar
  • Proven ability to lead complex cross-org technical initiatives
  • Experience designing reliable systems with clear SLOs, observability, incident response, and automation