About the team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities
- Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models
- Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
- Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
- Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
- Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
- Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world infrastructure solutions
Requirements
Minimum Qualifications
- Deep expertise in large-scale distributed training of LLMs and multimodal models
- Strong systems research background with demonstrated ability to design, build, and optimize large-scale ML systems
- Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters
- Strong programming skills and hands-on experience implementing production-grade ML systems or infrastructure
- Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability