Networking Solution Test Engineer - AI Ib and Ethernet Cluster Debugging

NVIDIA NVIDIA · Semiconductors · Shanghai, China +1

NVIDIA is seeking a Networking Solution Test Engineer to join their End-to-End Verification team. The role involves working on cutting-edge Ethernet-based AI clusters, debugging complex issues across hardware, system software, and AI workloads. Responsibilities include designing test requirements, building testbeds, owning end-to-end cluster troubleshooting, collaborating with development teams on networking components, defining tests for automation, running regression and performance tests, and profiling deep learning workloads.

What you'd actually do

  1. Design and review test and product requirements across the InfiniBand / Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior.
  2. Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.
  3. Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.
  4. Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
  5. Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

Skills

Required

  • networking or system-level testing and debugging on Linux
  • Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2)
  • production-grade debugging experience
  • host-side NIC validation and tuning
  • AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA)
  • reading and reasoning about source code (C/C++/Python or similar)
  • scripting and automation skills with Bash / Python / Ansible
  • familiar with modern AI tools and workflows
  • analytical, problem-solving and communication skills
  • ownership and a collaborative mindset

Nice to have

  • debugging of collective communication libraries (for example NCCL)
  • debugging large-scale LLM training / inference clusters
  • tuning and debugging congestion control and lossless Ethernet for AI workloads
  • NVIDIA networking technologies
  • debugging issues that span multiple layers
  • contributing to open-source networking / AI systems

What the JD emphasized

  • large-scale AI cluster behavior
  • end-to-end cluster troubleshooting
  • debug NCCL, RoCE/RDMA
  • deep learning training and inference workloads
  • production-grade debugging experience
  • AI networking libraries
  • large-scale LLM training / inference clusters

Other signals

  • AI cluster behavior
  • AI workloads
  • deep learning training and inference workloads
  • NCCL
  • RoCE/RDMA