Manager, Next-gen AI Cluster Validation

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Manager to lead a team developing and validating next-generation NVIDIA AI supercomputing systems, integrating new compute, networking, storage, and software. Focus on building a platform for software development, automation, and performance engineering, and supporting large-scale deployments for AI and HPC.

What you'd actually do

  1. Lead a team developing next generation system designs and integrating new compute, networking, storage, and software systems
  2. Build and support a platform for software development, systems automation, and performance engineering
  3. Develop tooling and documentation to support the development of large-scale supercomputing systems for AI and HPC both inside and outside NVIDIA
  4. Work closely with teams throughout the company on the cluster architecture, at-scale bringup, and integration of new technologies and products
  5. Collaborate closely with partners and customers to support deployment and validation of clusters based on NVIDIA reference architectures

Skills

Required

  • BS (Masters or PhD preferred) in Applied Science or Engineering (or equivalent experience)
  • 8+ overall years experience of experience in the high-performance computing or machine learning fields
  • 3+ years of technical leadership experience
  • Proficiency in software development and system automation with languages such as Go, Python, or Ansible
  • Creative problem-solver with excellent teamwork and collaboration skills
  • Ability to work as part of a large, diverse team in a remote-friendly environment

Nice to have

  • Experience leading teams building HPC compute and storage systems in a research environment at large scale
  • Well-developed knowledge of deep learning applications, including multi-GPU and multi-node training and inference workloads
  • Expertise with high-performance datacenter networking such as InfiniBand and RoCE
  • Expertise with open-source monitoring technologies such as Prometheus and Grafana
  • Have a proven track record of growing and managing a team that encourages idea sharing, empowers team members, and provides opportunities for professional growth

What the JD emphasized

  • technical leadership experience
  • lead high-performing engineering teams
  • multi-GPU and multi-node training and inference workloads

Other signals

  • AI supercomputing systems
  • GPU computing
  • machine learning
  • HPC