Senior Hpc Solutions Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

This role focuses on supporting NVIDIA's AI factory deployments by assisting with the deployment, debugging, and optimization of AI workloads on NVIDIA platforms. The Senior HPC Solutions Architect will work with customers to identify and resolve cluster performance and stability issues, benchmark framework features, and guide customers in scaling workloads on NVIDIA GPUs. The role requires strong networking and system-level understanding, with experience in large-scale training workloads and parallel applications.

What you'd actually do

  1. Assisting with deployment, debugging, and improving the efficiency of AI workloads on extensive NVIDIA platforms.
  2. Identifying hardware issues, supervising them through bugs, and keeping customers updated on current progress.
  3. Benchmarking new framework features, analyzing performance, and sharing actionable insights with both customers and internal teams.
  4. Working directly with external customers/partners to solve cluster performance and stability issues, identify bottlenecks, and implement effective solutions.
  5. Build expertise and guide customers in scaling workloads efficiently and reliably on the latest generation of NVIDIA GPUs.

Skills

Required

  • BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or other Engineering fields, or equivalent experience.
  • 10+ years of experience in designing, managing, and supporting large-scale hybrid networks.
  • Strong programming skills in at least one of the following languages: C, C++, or Python.
  • Practical experience identifying and resolving bottlenecks in large-scale training workloads or parallel applications.
  • Proven understanding of CPU and GPU architectures, CUDA, parallel filesystems, and high-speed interconnects.
  • Experienced in working with large compute clusters with an understanding of their internal scheduling and resource management mechanisms (e.g. SLURM or Cloud based clusters).
  • System-level understanding of server/rack-level architecture, BMC, PCIe devices, Network Adapters, Linux OS, and kernel drivers.
  • Excellent communication and liaison skills to work with customers, partners, and internal functions.

Nice to have

  • Systems engineering, coding, and debugging skills, including experience C/C++, Linux kernel, and drivers
  • Hands-on experience with NVIDIA systems/SDKs (e.g. CUDA), NVIDIA Networking technologies (e.g., DPU, RoCE, InfiniBand), and/or ARM CPU solutions
  • Hands-on experience in the Linux Environment and software-defined networking.
  • Experience with system board architectures and familiarity with x56, 64-bit, and low-level hardware programming.

What the JD emphasized

  • large-scale networking projects
  • large-scale training workloads
  • large compute clusters

Other signals

  • AI factory deployments
  • deploying AI workloads
  • large-scale training workloads