Senior LLM Train Framework Engineer

NVIDIA · Semiconductors · Shanghai, China

NVIDIA is seeking a Senior LLM Train Framework Engineer to contribute to the Megatron Core team, focusing on building and developing open-source frameworks for LLM and Multimodal foundation model pretraining and post-training. The role involves addressing AI training and inference challenges across the model lifecycle, enhancing distributed training strategies, and optimizing performance on NVIDIA GPUs.

What you'd actually do

  1. Build and develop open source Megatron Core.
  2. Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models.
  3. Work at the intersection of AI applications, libraries, frameworks, and the entire software stack.
  4. Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches.
  5. Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures.

Skills

Required

  • MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience.
  • Experience with AI train frameworks (e.g., PyTorch, JAX), and/or inference and deployment environments (e.g., TRTLLM, vLLM, SGLang).
  • Proficiency in decentralized instruction.
  • Proficient in Python programming, software development, debugging, performance analysis, test composition, and documentation.
  • Strong understanding of AI/Deep-Learning fundamentals and their practical applications.

Nice to have

  • CUDA or collective programming skills are a big plus.
  • Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations.
  • Proficient in large-scale AI training, knowledgeable in compute system concepts like latency and efficiency.
  • Expertise in distributed computing, model parallelism, and mixed precision training.
  • Prior experience with Generative AI techniques applied to LLM and Multi-Modal learning (Text, Image, and Video).
  • Knowledge of GPU/CPU architecture and related numerical software.
  • Familiarity with cloud computing (e.g., complete pipelines for AI training and inference on CSPs like AWS, Azure, GCP, or OCI).

What the JD emphasized

  • extensive AI training and inference obstacles
  • entire model lifecycle
  • foundation model training
  • model optimizations
  • AI training frameworks
  • large-scale AI training
  • Generative AI techniques applied to LLM and Multi-Modal learning

Other signals

  • LLM pretraining
  • foundation model training
  • distributed training algorithms
  • model optimizations
  • AI training frameworks