Senior Manager, GPU Cloud Infrastructure - Geforce Now

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Manager to lead the design, scaling, and operations of high-performance networking for GPU-based cloud infrastructure, critical for cloud gaming, AI/ML training, and inference platforms.

What you'd actually do

  1. Build and mentor a specialized team of network architects focused on high-performance GPU infrastructure.
  2. Oversee the design of intra-cluster and inter-cluster connectivity, utilizing RoCE, Ethernet-based AI fabrics, and high-bandwidth data center interconnects.
  3. Drive technical tuning to reduce latency, jitter, and increase throughput while implementing congestion control and packet-loss mitigation strategies.
  4. Define the roadmap for networking strategies that support gaming, AI/ML training, and real-time inference at scale.
  5. Engage with ISPs to optimize low-latency edge networks and ensure a seamless connection from our data centers to end clients.

Skills

Required

  • 12+ overall years of proven experience in networking, cloud infrastructure, or distributed systems
  • 5+ years of experience directly managing technical teams
  • Mastery of data center networking, including Clos/spine-leaf architectures and high-performance fabrics like RDMA, RoCE, or InfiniBand
  • Hands-on experience with BGP, EVPN/VXLAN, and kernel-level development for routing and switching
  • Skilled in using Ansible or Terraform for infrastructure automation, paired with monitoring tools like Prometheus and Grafana
  • Practical experience designing for large-scale configurations using SR-IOV, Xen virtualization, or Open Virtual Switch
  • Bachelor’s or Master’s degree in Computer Science or a related engineering field (or equivalent experience)
  • Ability to ensure all infrastructure meets rigorous internal policies and regulatory standards like GDPR

Nice to have

  • Proven success managing networking for large-scale GPU clusters or hyperscale cloud environments
  • Familiarity with optical networking and high-speed interconnects reaching 400G or 800G
  • Experience in debugging and improving code for Mellanox/Cumulus Linux or managing Palo Alto and Netscaler appliances
  • A strong grasp of streaming telemetry and operational signals (SNMP, Syslog) to proactively resolve complex architectural bottlenecks
  • Relevant top-tier certifications, such as CCIE or specialized cloud networking designations

What the JD emphasized

  • high-performance GPU infrastructure
  • AI/ML training
  • real-time inference
  • large-scale configurations
  • rigorous internal policies and regulatory standards

Other signals

  • GPU Cloud Infrastructure
  • AI/ML training and inference platforms
  • ultra-low-latency, high-throughput, and highly reliable interconnects