Engineering Manager, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · CA +2 · Remote

Engineering Manager for NVIDIA DGX Cloud, leading a team focused on reliable, scalable GPU infrastructure operations, automation, and lifecycle tooling for Kubernetes-based environments.

What you'd actually do

  1. Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments.
  2. Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response.
  3. Help define team priorities, roadmap, staffing, and operational ownership.
  4. Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness.
  5. Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes.

Skills

Required

  • leading or managing engineers
  • production infrastructure
  • cloud platforms
  • Kubernetes
  • distributed systems
  • reliability engineering
  • automation
  • observability
  • incident response
  • operational excellence
  • cross-team collaboration
  • communication
  • prioritization
  • judgment

Nice to have

  • SRE
  • production engineering
  • infrastructure automation
  • platform teams
  • GPU infrastructure
  • Kubernetes fleet operations
  • GitOps
  • BMaaS/VMaaS
  • managed Kubernetes
  • multi-cloud environments
  • reducing toil
  • improving SLOs
  • software-driven systems

What the JD emphasized

  • 8+ overall years of industry experience, including 2+ years leading or managing engineers
  • Experience building or operating production infrastructure, cloud platforms, Kubernetes environments, or distributed systems
  • Strong understanding of reliability engineering, automation, observability, incident response, and operational excellence
  • Track record of reducing toil, improving SLOs, and turning operational work into software-driven systems