Senior Network Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role focuses on the architecture, design, and deployment of ultra-high-speed, resilient, and scalable network interconnects for GPU-accelerated data centers and compute clusters, specifically supporting AI/ML training and inference environments. It requires expertise in networking technologies, automation, and security standards.

What you'd actually do

  1. Lead the architecture, design, and deployment of global‑scale backbone and data center fabrics that serve large fleets of CPU‑based compute, storage, and GPU/HPC clusters.
  2. Design high‑performance DC fabrics using InfiniBand and high‑throughput Ethernet (RoCE and traditional IP) to support both general compute workloads and GPU‑dense AI/ML training and inference environments.
  3. Engineer and optimize carrier interconnects, metro and long‑haul backbone, and dark‑fiber systems to provide low‑latency, loss‑minimal connectivity between regions, super labs, and data centers.
  4. Partner with systems, OS, GPU, storage, and HPC platform teams to deliver scalable, highly available network architectures that can evolve with rapid growth in both compute and GPU capacity.
  5. Implement and refine network monitoring, rich telemetry, and performance‑engineering practices across fabrics and backbone to detect issues early and continually improve end‑to‑end application experience.

Skills

Required

  • MS or PhD in Electrical Engineering, Computer Science, Computer Engineering, Artificial Intelligence, Data Science, Mathematics, Statistics, or equivalent experience
  • 12+ years of experience in building, managing and supporting large scale hybrid networks
  • developing automation pipelines with Python, Ruby, Go or other languages used in infrastructure automation
  • Expert in networking technologies: TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, Segment Routing, MPLS, IS-IS, DWDM
  • Experience automating SDN/NFV/NFVI Infrastructure

What the JD emphasized

  • ultra-high-speed
  • resilient
  • scalable interconnects
  • GPU-accelerated data centers
  • AI/ML training and inference environments
  • low-latency
  • loss-minimal connectivity
  • rapid growth
  • security, compliance, and reliability standards