Senior Manager, Dgx Cloud Technical Program Management

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Manager, Technical Program Management to lead core infrastructure programs (network, storage, trust services, security, break/fix, telemetry) for NVIDIA's DGX Cloud. The role involves managing a team of TPMs, driving operational rigor, and ensuring infrastructure resilience and scalability. Requires extensive experience in technical program management, infrastructure programs, and managing TPM teams, with a strong understanding of cloud infrastructure and distributed systems. Experience supporting AI/ML platforms is a plus.

What you'd actually do

  1. Lead and nurture a team of Technical Program Managers engaged in DGX Cloud core infrastructure projects.
  2. Propel progress across network, storage, trust services, security programs, telemetry, and break/fix operational workstreams.
  3. Partner with engineering, product, operations, security, and cloud provider teams to define priorities, achievements, dependencies, and delivery plans.
  4. Build clear operating rhythms for infrastructure planning, managing blocking issues, risk tracking, and cross-functional decision-making.
  5. Improve access to infrastructure health, delivery status, blockers, and program risks through practical metrics, dashboards, and reporting.

Skills

Required

  • technical program management
  • infrastructure program management
  • directing or supervising TPMs
  • managing infrastructure programs (networking, storage, security, trust services, observability, telemetry, or cloud operations)
  • managing priorities, dependencies, risks, and execution plans
  • building TPM operating rhythms
  • cloud infrastructure
  • distributed systems
  • large-scale platform operations
  • communication skills
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field, or equivalent experience

Nice to have

  • supporting infrastructure for AI/ML platforms
  • GPU clusters
  • large-scale cloud services
  • observability and telemetry tools (Grafana, Prometheus, or similar)
  • security, trust, compliance, or reliability programs in cloud infrastructure environments
  • improving operational processes for break/fix, incident response, or infrastructure readiness
  • strong technical judgment
  • partnering closely with engineering leaders
  • developing TPM talent

What the JD emphasized

  • upwards of 3 years directing or supervising TPMs
  • managing infrastructure programs
  • managing priorities, dependencies, risks, and execution plans
  • building TPM operating rhythms
  • cloud infrastructure
  • large-scale platform operations
  • supporting infrastructure for AI/ML platforms
  • observability and telemetry tools
  • security, trust, compliance, or reliability programs in cloud infrastructure environments
  • operational processes for break/fix, incident response, or infrastructure readiness