Senior Data Center Performance Engineer - Benchmarking and Optimization

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Data Center Performance Engineer at NVIDIA focused on benchmarking and optimizing data center platforms for AI training, inference, and HPC workloads. Responsibilities include designing benchmarks, characterizing workloads, identifying bottlenecks, and driving performance improvements through system tuning and architectural recommendations.

What you'd actually do

  1. Design and execute comprehensive performance benchmarking strategies for our data center platforms and products
  2. Characterize real-world AI training, inference, and HPC workloads at scale
  3. Define, track, and report key performance indicators (throughput, latency, efficiency, scaling)
  4. Build automation tools and frameworks for performance monitoring and analysis
  5. Identify and analyze performance bottlenecks across compute, memory, network and storage subsystems

Skills

Required

  • M.S. or Ph.D. in Computer Science, Electrical Engineering or related field (or equivalent experience)
  • 8+ years of experience in performance engineering or system architecture
  • Deep understanding of computer architecture, hardware-software interaction and computing at-scale
  • Strong proficiency in performance profiling tools (Linux perf, NVIDIA Nsight Systems)
  • Familiarity with GPU computing and parallel programming (CUDA)
  • Background with HPC networking technologies (InfiniBand, RoCE, NVLink)
  • Programming skills in Python, C++, and shell scripting
  • Excellent analytical and problem-solving abilities

Nice to have

  • Experience with AI/ML frameworks (PyTorch, TensorFlow, JAX)
  • Knowledge of MPI, collective communications (NCCL), distributed training and inference
  • Familiarity with NVIDIA DGX, HGX platforms and other data center solutions
  • Familiar with containers, cloud provisioning and scheduling tools (Docker, Kubernetes, SLURM)
  • Understanding of storage systems and I/O performance
  • Track record of performance optimization in production environment
  • Experience with AI code generation tools

What the JD emphasized

  • performance engineering
  • system architecture
  • performance profiling tools
  • GPU computing
  • performance optimization

Other signals

  • performance benchmarking
  • AI training
  • inference
  • HPC workloads
  • GPU computing