Senior Technical Program Manager, Cloud Infrastructure, Observability and Systems Monitoring

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Technical Program Manager for NVIDIA DGX Cloud, focusing on observability, systems monitoring, and cloud infrastructure operations to manage AI infrastructure.

What you'd actually do

  1. Establishing a balanced feedback loop between DGX Cloud and other organizations at NVIDIA to align and unify telemetry requirements and mentorship for external partners.
  2. Driving the end-to-end telemetry lifecycle for upcoming NVIDIA Cloud Providers (NCPs). This includes ensuring requirements are committed, delivered, and ingested into a centralized telemetry platform to enable DGXC operations.
  3. Participating in the early product lifecycle (Day -1 / Day 0) to examine the Plan of Record (POR) for new silicon, systems, firmware, and software architectures (example: VR). This ensures telemetry requirements for advanced tenants are coordinated.
  4. Collaborating across technical domains (including NVLink, InfiniBand, SpectrumX, GPU, CPU, and DPU or equivalent experience) to establish standard telemetry operations mentorship across NVIDIA.
  5. Driving a program for NVIDIA’s attestation platform to verify device integrity, authenticity, and trust across the accelerated computing ecosystem.

Skills

Required

  • 12+ years of technical program management experience
  • driving the planning and execution of large-scale engineering, cloud infrastructure, and observability programs
  • managing cloud infrastructure
  • acting as the interface between cross-functional organizations
  • managing complex feedback loops
  • resolving misaligned requirements
  • Expert-level proficiency with Jira, Smartsheet, or similar program management tools
  • guide engineering teams on their effective use within an Agile framework
  • strategic and tactical thinking abilities
  • build consensus
  • drive program success across diverse business units
  • communication and technical presentation skills
  • BS or MS in Electrical Engineering or Computer Science, or equivalent experience

Nice to have

  • Comprehensive knowledge of NVIDIA architectures and interconnects
  • deployment, bring-up, and telemetry requirements for GPUs, NVLink, and InfiniBand
  • Open Telemetry (OTel)
  • Grafana
  • Warpstream
  • VictoriaMetrics
  • Loki
  • cloud platform architecture
  • cloud-native services
  • Kubernetes
  • enthusiastic, upbeat, responsive, and passionate individual
  • actively identifies process improvement opportunities
  • guides teams through ambiguity

What the JD emphasized

  • telemetry requirements
  • telemetry
  • AI infrastructure