Senior Systems Software Engineer, AI Stack and Performance - Dgx Station

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Systems Software Engineer focused on optimizing AI stack performance and readiness on NVIDIA's DGX Station, a workstation-class AI computer. The role involves profiling, identifying bottlenecks, and driving optimizations across the full stack from GPU kernels to applications, ensuring AI workloads like LLM inference and agents run efficiently in multi-GPU, multi-user configurations. Collaboration with framework, compiler, and GPU architecture teams is critical.

What you'd actually do

  1. AI Application Readiness: Own production readiness of AI applications on DGX Station—NemoClaw, Hermes agents, NIM microservices, and key customer workloads. Define “ready to ship” criteria, run validation, and close every gap between “it runs” and “it runs well” across single-GPU and multi-GPU configurations.
  2. DL Framework Performance: Work cross functionally with different orgs to profile and optimize LLM and deep learning workloads (PyTorch, TensorFlow, JAX) across training and inference on the GB300 Blackwell multi-GPU architecture. Characterize performance across model sizes, batch sizes, precision modes (FP16, INT8, FP8), and GPU scaling (single-GPU vs. multi-GPU with NVLink) to establish benchmarks and identify regression.
  3. System-Level Optimization: Identify bottlenecks in GPU compute, NVLink bandwidth, host memory, PCIe, and CPU–GPU communication. Implement or drive optimizations across the stack: kernel tuning, memory placement, NVLink utilization, data pipeline efficiency, and scheduling to increase throughput on DGX Station’s multi-GPU topology.
  4. Compiler & Kernel Collaboration: Work with NVIDIA’s framework, compiler (TensorRT, NVCC, Triton), and GPU architecture teams to improve kernel fusion, graph execution, operator scheduling, and memory management for Blackwell GPUs. Translate DGX Station’s platform-specific constraints and multi-GPU topology into actionable optimization requests for upstream teams.
  5. Multi-User & Concurrency: Validate multi-user and concurrent workload scenarios—multiple users running simultaneous training jobs, inference serving alongside development, and resource isolation via MIG or time-slicing. Ensure DGX Station performs reliably as a shared workstation.

Skills

Required

  • systems software engineering
  • AI/ML workload optimization
  • GPU performance analysis
  • deep learning infrastructure
  • PyTorch
  • TensorFlow
  • JAX
  • graph execution
  • operator dispatch
  • memory management
  • custom kernel integration
  • Nsight Systems
  • Nsight Compute
  • CUPTI
  • GPU architecture
  • NVLink
  • multi-GPU scaling
  • inference optimization
  • quantization
  • model compilation
  • batching strategies
  • serving frameworks
  • C++
  • CUDA
  • Python

Nice to have

  • optimizing LLM training or inference on multi-GPU NVIDIA systems
  • Contributions to open-source AI frameworks, CUDA libraries, or inference engines
  • multi-GPU communication optimization
  • NCCL tuning
  • NVLink utilization
  • collective operations
  • parallel training strategies
  • collaborating with compiler and hardware architecture teams
  • kernel fusion
  • graph optimization
  • hardware-specific performance improvements

What the JD emphasized

  • AI applications like NemoClaw, LLM inference via NIM, Hermes agents, and deep learning frameworks must run production-ready out of the box
  • Own production readiness of AI applications on DGX Station
  • Ensure DGX Station delivers best-in-class performance for real AI workloads
  • Experience shipping AI-powered products where application performance on specific hardware was a hard shipping requirement

Other signals

  • AI Application Readiness
  • DL Framework Performance
  • System-Level Optimization
  • Compiler & Kernel Collaboration
  • Multi-User & Concurrency
  • Stack Validation
  • Benchmarking & Regression
  • Customer & Partner Alignment