About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities
- Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
- Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
- Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
- Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
- Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively
- Participate in global team rotations for system monitoring, on-call support, and incident response
Requirements
Minimum Qualifications:
- Strong background in computer systems or computer architecture
- Proficiency in C++ and Python, with strong systems engineering skills
- Solid understanding of the PyTorch training framework, including: Training execution flow and runtime behavior; CUDA kernel scheduling and synchronization semantics; Interactions between PyTorch, CUDA, NCCL, and the networking stack
- Familiarity with distributed training frameworks or parallelization strategies, such as: Megatron-LM (tensor, pipeline, and sequence parallelism); Fully Sharded Data Parallel (FSDP), including parameter sharding, communication, and memory management
- Ability to reason about performance bottlenecks in complex training systems and conduct systematic analysis
Preferred Qualifications:
- Experience with one or more of the following tools or technologies: torch.profiler, Nsight Systems, Nsight Compute; CUPTI, NVTX; Distributed communication fundamentals (e.g., NCCL, RDMA)