What you'd actually do

Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models

Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads

Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments

Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services

Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively
Participate in global team rotations for system monitoring, on-call support, and incident response

Requirements

Minimum Qualifications:

Strong background in computer systems or computer architecture
Proficiency in C++ and Python, with strong systems engineering skills
Solid understanding of the PyTorch training framework, including: Training execution flow and runtime behavior; CUDA kernel scheduling and synchronization semantics; Interactions between PyTorch, CUDA, NCCL, and the networking stack
Familiarity with distributed training frameworks or parallelization strategies, such as: Megatron-LM (tensor, pipeline, and sequence parallelism); Fully Sharded Data Parallel (FSDP), including parameter sharding, communication, and memory management
Ability to reason about performance bottlenecks in complex training systems and conduct systematic analysis

Preferred Qualifications:

Experience with one or more of the following tools or technologies: torch.profiler, Nsight Systems, Nsight Compute; CUPTI, NVTX; Distributed communication fundamentals (e.g., NCCL, RDMA)

Responsibilities

Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively
Participate in global team rotations for system monitoring, on-call support, and incident response

Requirements

Minimum Qualifications:

Strong background in computer systems or computer architecture
Proficiency in C++ and Python, with strong systems engineering skills
Solid understanding of the PyTorch training framework, including: Training execution flow and runtime behavior; CUDA kernel scheduling and synchronization semantics; Interactions between PyTorch, CUDA, NCCL, and the networking stack
Familiarity with distributed training frameworks or parallelization strategies, such as: Megatron-LM (tensor, pipeline, and sequence parallelism); Fully Sharded Data Parallel (FSDP), including parameter sharding, communication, and memory management
Ability to reason about performance bottlenecks in complex training systems and conduct systematic analysis

Preferred Qualifications:

Experience with one or more of the following tools or technologies: torch.profiler, Nsight Systems, Nsight Compute; CUPTI, NVTX; Distributed communication fundamentals (e.g., NCCL, RDMA)

Research Engineer – AI Training Systems Reliability & Performance (seed Infra)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Requirements

Requirements