Research Engineer - LLM Training Infrastructure - Seed Infra

ByteDance · Big Tech · Seattle, WA · R&D

Research Engineer focused on large-scale LLM training infrastructure, optimizing distributed training strategies, system reliability, and performance across GPU clusters. Bridges research and production deployment.

What you'd actually do

Conduct research and development on large-scale LLM training infrastructure and efficiency
Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

Skills

Required

large-scale distributed training for LLMs
Python
C++
ML systems / training infrastructure development
parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
training stack internals (PyTorch, CUDA, NCCL)
performance optimization (memory, communication, throughput)

Nice to have

distributed training frameworks
large-scale LLM infrastructure
leading or mentoring engineering teams
benchmarking AI accelerators
large-scale LLM evaluation

What the JD emphasized

large-scale distributed training for LLMs
ML systems / training infrastructure development
parallelism strategies
training stack internals
performance optimization

Other signals

large-scale distributed training
LLM training infrastructure
performance optimization
GPU clusters

Read full job description

Team Information: The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Conduct research and development on large-scale LLM training infrastructure and efficiency
Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions

Requirements

Minimum Qualifications

Experience with large-scale distributed training for LLMs
Strong programming skills in Python and/or C++
Strong background in ML systems / training infrastructure development
Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
Solid understanding of training stack internals (PyTorch, CUDA, NCCL)
Experience in performance optimization (memory, communication, throughput)

Preferred Qualifications

Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
Experience leading or mentoring engineering teams or cross-functional projects
Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions
Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)

Responsibilities

Conduct research and development on large-scale LLM training infrastructure and efficiency
Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions

Requirements

Minimum Qualifications

Experience with large-scale distributed training for LLMs
Strong programming skills in Python and/or C++
Strong background in ML systems / training infrastructure development
Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
Solid understanding of training stack internals (PyTorch, CUDA, NCCL)
Experience in performance optimization (memory, communication, throughput)

Preferred Qualifications

Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
Experience leading or mentoring engineering teams or cross-functional projects
Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions
Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)