Senior Network Solution Architect – AI Fabrics

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

This role focuses on architecting and deploying NVIDIA's AI networking platforms (e.g., Spectrum-X, BlueField DPU, InfiniBand/RoCE) in customer data centers, ensuring the performance and reliability of AI clusters. It involves deep technical expertise in networking protocols, system software, and hardware integration, with a strong emphasis on troubleshooting and customer-facing support.

What you'd actually do

  1. Partner with AI-native / consumer internet customers on large data center GPU and networking deployments. Guide architecture decisions across network, compute, and storage, including fabric design. Support on-site bring-up of server, network, and cluster infrastructure in customer data centers.
  2. Demonstrate expertise on advanced GPU and network systems (Spectrum-X, BlueField DPU, InfiniBand/RoCE, etc.) for key accounts. Run regular technical account reviews covering roadmap alignment, cluster issues, feature discussions, and new technology introductions. Capture customer-specific requirements and translate them into concrete feedback for product, architecture, and engineering teams.
  3. Analyze and debug configuration and performance issues in RoCE and InfiniBand environments. Work across NICs, switches, Linux, and system software to deliver performant, reliable AI clusters.
  4. Identify and shape new project opportunities for NVIDIA GPUs, networking, and software in AI and data center use cases. Collaborate closely with Systems Engineering, Product Management, and Sales to align solutions with customer outcomes. Build targeted POCs that showcase the value of NVIDIA’s networking stack (e.g., Spectrum-X fabrics, BlueField DPUs) in real customer environments.

Skills

Required

  • BS/MS/PhD in Electrical/Computer Engineering, Computer Science, or other Engineering fields or equivalent experience
  • 6+ years of hands-on network engineering experience in data center or cloud environments
  • Proven, expert-level troubleshooting of data center networks (packet-level, control plane, and fabric behavior)
  • Deep protocol knowledge of BGP, OSPF, and L2/L3 switching in large-scale data center or cloud networks (ECMP, Clos/leaf–spine)
  • Experience with high-density switching at cloud or hyperscale is strongly preferred
  • Experience with InfiniBand or RoCE is a major plus
  • Solid understanding of CPU/GPU server architecture, NICs, Linux, system software, and kernel drivers
  • Strong time management and ability to context-switch across multiple customers and projects
  • Excellent written and verbal communication, including clear design docs, customer presentations, and root-cause summaries

Nice to have

  • Advanced certifications: CCIE, JNCIE, or equivalent expert-level certifications
  • Automation & tooling: Experience in Python, Bash, or C/C++ for automating network workflows, validation, and debug
  • NVIDIA platform experience: Hands-on work with NVIDIA GPUs, NICs, DPUs, or ARM-based CPU platforms
  • Customer-facing background: Pre-sales, post-sales, field engineering, or consulting experience with external enterprise or cloud customers
  • Large-scale deployments: Direct experience bringing up and operating large clusters or supercomputing environments
  • Virtualization / cloud: Familiarity with virtualization, containers, and cloud networking concepts

What the JD emphasized

  • expert-level troubleshooting
  • Deep protocol knowledge
  • Proven, expert-level troubleshooting