Manager, AI Networking Performance Research and Analysis

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel

Manager for AI Networking Performance Research and Analysis at NVIDIA, focusing on optimizing networking technologies (NIC, Switch) for AI workloads like LLM training and inference. The role involves end-to-end performance strategy, from pre-silicon modeling to GA, and building telemetry frameworks and dashboards for performance tracking and root cause analysis. Requires strong experience in high-performance networking, cluster performance, and managing engineering teams, with a focus on Python, Bash, and C/C++.

What you'd actually do

  1. Lead performance research and evaluation of advanced networking technologies supporting AI workloads, including LLM training and inference at supercomputing scale.
  2. Define end-to-end performance test plans and methodology for next-generation Networking HW and networking technologies, including performance expectations and target KPIs.
  3. Drive benchmarking, profiling, reporting, and deep performance characterization of networking workloads and offload features.
  4. Collaborate closely with simulation, architecture, chip-design, firmware, and software teams to assess performance tradeoffs and identify bottlenecks.
  5. Perform deep root cause analysis (RCA) for performance gaps and stability issues, and drive cross-team mitigation plans.

Skills

Required

  • B.Sc in Computer Science or Software Engineering
  • 5+ years of experience with high-performance Networking technologies (RDMA, Storage, Security, OVS, MPI)
  • 3+ years as an engineering team manager
  • Demonstrated Performance Analysis skills and methodologies
  • Experience with Cluster level performance, Telemetry, NIC, DPUs, Switches, and GPUs
  • Fast and self-learning capabilities with strong analytical and problem solving skills
  • Python
  • Bash
  • C/C++
  • Linux OS distros
  • Team player and a leader with good communication and interpersonal skills

Nice to have

  • Deep system-level architecture knowledge (Intel / AMD / ARM CPUs, NVIDIA GPUs, HCA/DPU architecture, memory subsystems, PCIe, storage, NVLink)
  • Strong expertise in RDMA networking performance and AI communication stacks (e.g., NCCL)
  • Proven experience analysing AI workload communication patterns and benchmarking distributed LLM training workloads at scale
  • Experience designing telemetry frameworks, monitoring pipelines, and performance dashboards for large clusters
  • Familiarity with modern AI tooling including performance-driven agents, automation pipelines, and RAG-based applications

What the JD emphasized

  • 5+ years of experience with high-performance Networking technologies
  • 3+ years as an engineering team manager
  • Demonstrated Performance Analysis skills and methodologies
  • Experience with Cluster level performance, Telemetry, NIC, DPUs, Switches, and GPUs

Other signals

  • AI Networking cluster level performance for AI WLs, distributed training, and Inference jobs
  • performance strategy and execution for next-generation NVIDIA NIC, Switch, and Networking technologies
  • scalable telemetry frameworks, performance dashboards, and job-level monitoring solutions