Principal AI and ML Infra Software Engineer, GPU Clusters

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

This role focuses on enhancing the efficiency of AI and ML research on GPU clusters by collaborating with researchers to identify and address infrastructure deficiencies. The engineer will optimize performance, monitor resource utilization, and contribute to the AI/ML infrastructure ecosystem, keeping up-to-date with the latest AI/ML technologies.

What you'd actually do

  1. Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.
  2. Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.
  3. Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.
  4. Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.
  5. Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.

Skills

Required

  • BS or similar background in Computer Science or related area (or equivalent experience)
  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
  • in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot)
  • Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX
  • in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines
  • Proficiency in programming & scripting languages such as Python, Go, Bash
  • familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
  • experience with parallel computing frameworks and paradigms
  • Excellent communication and collaboration skills

Nice to have

  • Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.

What the JD emphasized

  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
  • in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).
  • Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX.
  • in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.

Other signals

  • Enhancing efficiency for researchers
  • Implementing progressions throughout the entire stack
  • Pinpoint and address infrastructure deficiencies
  • Groundbreaking AI and ML research on GPU Clusters
  • Craft potent, effective, and scalable solutions