Senior System Software Engineer, AI Infrastructure

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior System Software Engineer focused on AI Infrastructure at NVIDIA, responsible for running training/inference jobs on GPU clusters, designing benchmarks, profiling workloads, and providing optimization guidance. The role involves collaborating with various teams to improve NVIDIA's developer products and customer experience with AI platforms, SDKs, and libraries.

What you'd actually do

  1. Run multi‑node training/inference jobs on large GPU clusters to assess performance, validate usability, improve products, and create developer education.
  2. Design benchmark suites that spotlight NVIDIA hardware, networking, and software stacks.
  3. Profile deep‑learning workloads, identify bottlenecks, and deliver optimization guidance.
  4. Produce concise tutorials, scripts, and whitepapers for customers and tech press.
  5. Analyze competitive solutions and craft data‑driven product positioning.

Skills

Required

  • 3+ years in software development, tech marketing, evangelism, or similar roles.
  • BS/MS in CS, CE, EE, or related field (or equivalent experience).
  • Strong Python and C++ skills for AI and HPC work.
  • Hands-on multi-node experience with Slurm, Kubernetes, or cloud CSP clusters.
  • Solid grasp of DL architectures, PyTorch, and distributed training methods.
  • Understanding of CPU/GPU architecture plus CUDA, cuDNN, TensorRT-LLM, Triton, NCCL.
  • Excellent written and verbal communication for technical and executive audiences.

Nice to have

  • Hands-on experience setting up and tuning HPC clusters with Slurm, Kubernetes, or other schedulers.
  • Public technical blogs, talks, forum activity, or notable open-source projects as well as prior work with customers and/or technical press on AI performance topics.
  • Exceptional communication skills that simplify complex technology for diverse audiences.
  • Familiarity with modern LLM architectures and ability to write Torch code and occasional custom GPU kernels.
  • Expertise in InfiniBand, NVLink, RoCE, RDMA, and collective-comm libraries.

What the JD emphasized

  • multi-node training/inference jobs
  • GPU clusters
  • deep-learning workloads
  • optimization guidance
  • competitive solutions
  • product positioning
  • PyTorch
  • distributed training methods
  • CPU/GPU architecture
  • CUDA, cuDNN, TensorRT-LLM, Triton, NCCL
  • HPC clusters
  • LLM architectures
  • GPU kernels
  • InfiniBand, NVLink, RoCE, RDMA, collective-comm libraries

Other signals

  • GPU accelerated AI
  • AI Infrastructure
  • developer products
  • AI platforms, SDKs, libraries and AI frameworks
  • multi-node training/inference jobs
  • GPU clusters
  • benchmark suites
  • deep-learning workloads
  • optimization guidance
  • competitive solutions
  • product positioning
  • PyTorch
  • distributed training methods
  • CPU/GPU architecture
  • CUDA, cuDNN, TensorRT-LLM, Triton, NCCL
  • HPC clusters
  • LLM architectures
  • GPU kernels
  • InfiniBand, NVLink, RoCE, RDMA, collective-comm libraries