Senior Systems Performance Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Systems Performance Engineer at NVIDIA focused on validating and optimizing GPU accelerated computing products, specifically for Deep Learning/AI applications. The role involves system architecture, performance modeling, and developing stress/performance testing strategies for ML/LLM workloads.

What you'd actually do

  1. System architecture, design, performance modelling, estimation across new models and new packages.
  2. Enable GPU SKU bring up, validation and model enablement.
  3. Develop system level stress and performance testing strategies using industry leading Deep Learning/AI applications.

Skills

Required

  • BSEE or BSCE or equivalent experience
  • 5+ years or more of experience in validating and debugging complex systems
  • Developing/running real world ML/LLM workload
  • Dynamo, TensorRT, Slurm, BCM skills mandatorily required
  • Proficiency in Cuda, Cublas and Cutlass
  • Deep understanding of computing architectures
  • Coding experience with python programming, running simulators
  • Experience with datacenter products including system management, security, networking, and storage

Nice to have

  • Knowledge of vLLM, SG Lang preferred
  • Background with x86/Arm server architectures and accelerated GPU computing
  • Track record of continuous process improvement with a passion for tools and automation

What the JD emphasized

  • Dynamo, TensorRT, Slurm, BCM skills mandatorily required
  • Developing/running real world ML/LLM workload
  • Proficiency in Cuda, Cublas and Cutlass

Other signals

  • developing/running real world ML/LLM workload
  • Dynamo, TensorRT, Slurm, BCM skills mandatorily required
  • Proficiency in Cuda, Cublas and Cutlass
  • Deep understanding of computing architectures