Engineering Manager, Dgx Cloud Production Engineering

NVIDIA NVIDIA · Semiconductors · CA +2 · Remote

Engineering Manager to lead a team of software and production engineers focused on Kubernetes-based operations, automation, reliability, and cluster lifecycle tooling for NVIDIA DGX Cloud infrastructure.

What you'd actually do

  1. Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments.
  2. Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response.
  3. Help define team priorities, roadmap, staffing, and operational ownership.
  4. Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness.
  5. Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes.

Skills

Required

  • industry experience
  • leading or managing engineers
  • building or operating production infrastructure
  • cloud platforms
  • Kubernetes environments
  • distributed systems
  • reliability engineering
  • automation
  • observability
  • incident response
  • operational excellence
  • cross-team collaboration
  • communication
  • prioritization
  • judgment
  • BS/MS in Computer Science or equivalent experience

Nice to have

  • SRE
  • production engineering
  • infrastructure automation
  • platform teams
  • GPU infrastructure
  • Kubernetes fleet operations
  • GitOps
  • BMaaS/VMaaS
  • managed Kubernetes
  • multi-cloud environments
  • reducing toil
  • improving SLOs
  • software-driven systems

What the JD emphasized

  • 8+ overall years of industry experience
  • 2+ years leading or managing engineers
  • Experience building or operating production infrastructure
  • cloud platforms
  • Kubernetes environments
  • distributed systems
  • reliability engineering
  • automation
  • observability
  • incident response
  • operational excellence
  • GPU infrastructure
  • Kubernetes fleet operations
  • GitOps
  • reducing toil
  • improving SLOs
  • turning operational work into software-driven systems