Distinguished Software Architect - Deep Learning and Hpc Communications

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Distinguished Software Architect role focused on designing and researching next-generation communication libraries and platforms for Deep Learning and High Performance Computing at NVIDIA. The role involves co-designing HW/SW solutions with GPU, Networking, and SW architects, driving adoption of new communication technologies, and keeping up with DL research. Requires deep expertise in HPC, parallel programming, communication runtimes, system/GPU architecture, and networking, with strong programming skills in C/C++.

What you'd actually do

  1. Research new communication technologies (e.g. expand the GPUDirect technology portfolio) and design new features for our communication libraries
  2. Propose innovative solutions in HW and SW for our next-gen platforms. You will co-design these solutions with the GPU, Networking, and SW architects and ensure seamless integration with the software stacks
  3. Inspire changes based on quantitative data coming from proof-of-concepts or detailed technical analysis/modeling
  4. Drive the adoption of new communication technologies across application verticals
  5. Keep up with the latest DL research and collaborate with diverse teams (internal and external), including DL researchers, and customers

Skills

Required

  • HPC
  • parallel programming models (MPI, SHMEM)
  • communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC)
  • computer and system architecture
  • GPU architecture
  • CUDA
  • high performance networking (Infiniband, Ethernet)
  • network design
  • network topologies
  • network debug and performance analysis
  • ML/DL fundamentals
  • parallel algorithms
  • fault tolerance and resiliency
  • performance analysis and optimizations for parallel applications on large clusters
  • developing applications using DL Frameworks (PyTorch, TensorFlow)
  • C or C++ for systems software development

Nice to have

  • Industry recognized leader in HPC/DL communications with history of patents, publications and conference talks and keynotes
  • Influential role in industry standards (e.g. MPI, OpenSHMEM) and open source software (e.g. PyTorch, UCX, Open MPI)

What the JD emphasized

  • PHD in Computer Science, Computer Engineering or related field or strong equivalent experience; 15+ years of relevant experience in academia or the industry
  • Expert in following areas: HPC, parallel programming models (MPI, SHMEM), at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC), computer and system architecture, GPU architecture and CUDA
  • Deep understanding of various aspects of high performance networking from prior work experience: network technologies (Infiniband, Ethernet), network design, network topologies, network debug and performance analysis
  • Strong in at least a few of these areas: ML/DL fundamentals and how they tie to communications, parallel algorithms, fault tolerance and resiliency, competitive assessments, performance analysis and optimizations for parallel applications on large clusters, developing applications using DL Frameworks (PyTorch, TensorFlow)

Other signals

  • communication performance between GPUs has a direct impact on the end-to-end application performance
  • push the limits on the state-of-the-art
  • next generation data center platforms
  • Deep understanding of various aspects of high performance networking
  • ML/DL fundamentals and how they tie to communications