Senior Hpc Performance Engineer - AI for Science at Scale

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior HPC Performance Engineer focused on optimizing large-scale, CUDA-backed ML training frameworks for AI in Science applications, particularly in digital biology and chemistry. The role involves kernel design, GPU porting, distributed learning, and algorithmic improvements within HPC software stacks.

What you'd actually do

  1. Design and implement computationally performant features for large scale, CUDA-backed ML training frameworks, using low level acceleration and scaling strategies such as kernel design, GPU porting, data structure innovations, distributed learning technologies
  2. Optimize computational performance of wide range of business-critical ML models via accelerated hardware and software stack, as well as algorithmic improvements
  3. Develop and maintain HPC software stack for atomistic modeling and generative machine learning in digital biology and beyond
  4. Collaborate with multiple HPC, AI infrastructure, and research teams
  5. Drive the testing and maintenance of the algorithms and software modules

Skills

Required

  • 5+ years of relevant experience
  • performance engineering
  • software design, building and packaging and launching software products
  • acceleration
  • parallel programming in C++, Python
  • CUDA or OAI Triton
  • PyTorch, JAX, Warp
  • HPC solutions to research problems for biology or chemistry
  • atomistic simulations

Nice to have

  • Contribution to major scientific AI for Science codebase with acceleration features such as new kernels
  • Familiarity with pioneering language and geometric models used in AI for Science applications in biology and chemistry

What the JD emphasized

  • performance engineering
  • acceleration
  • CUDA
  • PyTorch
  • JAX
  • Warp
  • kernels

Other signals

  • building the next generation of scientific machine learning (ML) frameworks
  • accelerate AI for Science and industries that depend on it
  • Design and implement computationally performant features for large scale, CUDA-backed ML training frameworks
  • Optimize computational performance of wide range of business-critical ML models