What you'd actually do

Build and develop open source Megatron Core.

Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models.

Work at the intersection of AI applications, libraries, frameworks, and the entire software stack.

Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches.

Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures.

Skills

Required

MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience.
Experience with AI train frameworks (e.g., PyTorch, JAX), and/or inference and deployment environments (e.g., TRTLLM, vLLM, SGLang).
Proficiency in decentralized instruction.
Proficient in Python programming, software development, debugging, performance analysis, test composition, and documentation.
Strong understanding of AI/Deep-Learning fundamentals and their practical applications.

Nice to have

CUDA or collective programming skills are a big plus.
Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations.
Proficient in large-scale AI training, knowledgeable in compute system concepts like latency and efficiency.
Expertise in distributed computing, model parallelism, and mixed precision training.
Prior experience with Generative AI techniques applied to LLM and Multi-Modal learning (Text, Image, and Video).
Knowledge of GPU/CPU architecture and related numerical software.
Familiarity with cloud computing (e.g., complete pipelines for AI training and inference on CSPs like AWS, Azure, GCP, or OCI).

NVIDIA is now looking for LLM Train Framework Engineers for the Megatron Core team. Megatron Core is open-source, scalable, and cloud-native frameworks built for researchers and developers working on Large Language Models (LLM) and Multimodal (MM) foundation model pretraining and post-training. Our GenAI Frameworks provide end-to-end model training, including pretraining, alignment, customization, evaluation, deployment, and tooling to optimize performance and user experience. Build on Megatron Core Framework's capabilities by inventing advanced distributed training algorithms and model optimizations. Collaborate with partners to implement optimized solutions.

What you’ll be doing:

Build and develop open source Megatron Core.
Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models.
Work at the intersection of AI applications, libraries, frameworks, and the entire software stack.
Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches.
Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures.
Performance tuning and optimizations of deep learning framework and software components.
Research, prototype, and develop robust and scalable AI tools and pipelines.

What we need to see:

MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience.
Experience with AI train frameworks (e.g., PyTorch, JAX), and/or inference and deployment environments (e.g., TRTLLM, vLLM, SGLang).
Proficiency in decentralized instruction.
Proficient in Python programming, software development, debugging, performance analysis, test composition, and documentation.
CUDA or collective programming skills are a big plus.
Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations.
Strong understanding of AI/Deep-Learning fundamentals and their practical applications.

Ways to stand out from the crowd:

Proficient in large-scale AI training, knowledgeable in compute system concepts like latency and efficiency.
Expertise in distributed computing, model parallelism, and mixed precision training.
Prior experience with Generative AI techniques applied to LLM and Multi-Modal learning (Text, Image, and Video).
Knowledge of GPU/CPU architecture and related numerical software.
Familiarity with cloud computing (e.g., complete pipelines for AI training and inference on CSPs like AWS, Azure, GCP, or OCI).

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working with us. If you're creative and autonomous, we want to hear from you! NVIDIA encourages diversity and is an equal opportunity employer, valuing all characteristics.