Team Information: The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities
- Conduct research and development on large-scale LLM training infrastructure and efficiency
- Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
- Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
- Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
- Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
- Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions
Requirements
Minimum Qualifications
- Experience with large-scale distributed training for LLMs
- Strong programming skills in Python and/or C++
- Strong background in ML systems / training infrastructure development
- Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
- Solid understanding of training stack internals (PyTorch, CUDA, NCCL)
- Experience in performance optimization (memory, communication, throughput)
Preferred Qualifications
- Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
- Experience leading or mentoring engineering teams or cross-functional projects
- Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions
- Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)