Developer Technology Engineer - AI

NVIDIA NVIDIA · Semiconductors · Shanghai, China +2

NVIDIA Developer Technology Engineer focused on optimizing AI workloads, particularly large language models (LLMs), on NVIDIA's GPU platform. The role involves deep dives into application performance, GPU kernel optimization, distributed training and inference, and collaboration with various internal teams and external developers. It requires strong software engineering skills, parallel programming expertise, and a focus on performance analysis and tuning.

What you'd actually do

  1. Working directly with key application developers to understand the current and future problems they are solving. You will build and optimize core parallel algorithms and data structures to deliver the most effective solutions using GPUs, through both library development and direct contribution to applications. This includes training and inference optimization for large language models (LLM), contributing to frameworks and open-source projects in the large language models ecosystem, such as Megatron and TRTLLM, SGLang, vLLM...
  2. Collaborating closely with the architecture, research, libraries, tools, and system software teams at NVIDIA to influence the build of next-generation architectures, software platforms, and programming models. This includes investigating impact on application performance and developer efficiency, and turning real-world developer feedback into actionable platform improvements.
  3. Engaging in deep optimization of high-performance operators, involving but not limited to GPU kernel optimization, instruction-level tuning, and compiler optimization. These optimizations will directly support customers or be coordinated within computation libraries and open-source projects across the community, like cuDNN, cuBLAS, and CUTLASS and Open- source libs like DeepGEMM, FlashMLA, FlashAttention, Flashinfer...
  4. Improving communication for broad distributed large language models workloads. You will spearhead advancements in distributed training and inference by refining communication libraries(NCCL,NCCL GIN , NVSHMEM) and engaging in open-source communication libraries(like DeepEP, NCCL EP). This demands in-depth study of interconnect topologies(NVLINK) and network protocols(InfiniBand/RoCE) to design efficient data transfer strategies and methods for compute-communication overlap.

Skills

Required

  • C, C++, Python, or Fortran
  • Software development, programming techniques, and algorithms
  • Parallel programming and accelerated computing
  • Full-stack performance analysis and optimization within large language models and high-performance computing
  • Solid software engineering fundamentals and system architecture thinking

Nice to have

  • Masters or doctoral degree
  • GPU programming
  • Expertise ranging from operator-level through framework-level to algorithm-level optimization
  • Experience in distributed communication optimization
  • Strong mathematical fundamentals, including linear algebra and numerical methods

What the JD emphasized

  • training and inference optimization for large language models (LLM)
  • distributed training and inference
  • GPU kernel optimization
  • distributed communication optimization

Other signals

  • LLM optimization
  • GPU kernel optimization
  • distributed training and inference