Senior Technical Program Manager, Dgx Cloud Software Products and Services

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Senior Technical Program Manager at NVIDIA, focusing on DGX Cloud software products and services. The TPM will lead programs to enhance resilience, reliability, and operational scale for AI training and inference environments. Responsibilities include driving improvements in service stability, defining resilience strategies, and building tooling for observability and recovery. The role requires collaboration across engineering, SRE, operations, and researchers, with a strong emphasis on data-driven insights and metrics.

What you'd actually do

  1. Lead cross-functional programs that improve resilience, reliability, operational scale, and fleet-wide goodput across DGX Cloud.
  2. Partner across infrastructure, platform, site reliability, operational, and tenant teams to identify systemic risks, resolve cross-stack dependencies, and improve end-to-end service stability.
  3. Drive the definition and adoption of resilience reference stacks, operational standards, and scalable guidelines that strengthen service readiness and recovery.
  4. Partner with engineering teams and researchers to support the development and delivery of open, modular software components for resilience, facilitating reusable and extensible capabilities across the platform.
  5. Build and scale resilience tooling and operational mechanisms that improve observability, failure detection and attribution, root cause analysis, recovery orchestration, and operational readiness.

Skills

Required

  • program management of large-scale software or infrastructure projects
  • leading complex cross-functional programs in cloud, infrastructure, distributed systems, or platform environments
  • analytical skills
  • assess issues across infrastructure, software, and operational layers
  • organizational skills
  • project management tools (e.g. Jira, Aha!, Confluence)
  • distributed version control systems (e.g. Git)
  • reliability engineering
  • resilience development
  • service performance metrics
  • goodput, efficiency, and utilization
  • working alongside engineering, SRE, operations, and technical collaborators
  • ambiguous, high-complexity environments
  • communication and presentation skills
  • problem-solving
  • conflict management skills

Nice to have

  • computer science, machine learning, deep learning, open-source software, and GPU technology, AI infrastructure, or large-scale compute platforms
  • large-scale AI training environments (e.g., distributed training frameworks, checkpointing, NCCL, Slurm or other schedulers)
  • management of customer workflows using large scale distributed computing
  • working with AI researchers or directly training and evaluating AI models
  • harnessing AI-enabled workflows and tools to improve program management efficiency, decision-making, execution visibility, and operational efficiency

What the JD emphasized

  • resilience
  • reliability
  • operational scale
  • service stability
  • fault-tolerant
  • high-availability
  • training and inference environments at scale
  • scalable resilience strategies
  • operational performance
  • resilience tooling and operational mechanisms
  • observability
  • failure detection and attribution
  • root cause analysis
  • recovery orchestration
  • operational readiness
  • goodput
  • usable fleet capacity
  • workload efficiency
  • customer outcomes at scale
  • program health
  • reliability posture
  • operational maturity
  • performance
  • large-scale software or infrastructure projects
  • complex cross-functional programs
  • cloud, infrastructure, distributed systems, or platform environments
  • infrastructure, software, and operational layers
  • reliability engineering
  • resilience development
  • service performance metrics
  • goodput, efficiency, and utilization
  • ambiguous, high-complexity environments
  • large-scale AI training environments
  • distributed training frameworks
  • checkpointing
  • NCCL
  • Slurm or other schedulers
  • customer workflows using large scale distributed computing
  • training and evaluating AI models
  • AI-enabled workflows and tools