Research Engineer – Multimodal Training Infrastructure (seed Infra)

ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on building and optimizing large-scale distributed training infrastructure for foundation models, including multimodal LLMs and image/video generation models. This role involves deep expertise in parallelism strategies, system reliability, and performance optimization on large GPU clusters, bridging research and production deployment.

What you'd actually do

Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models
Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

Skills

Required

large-scale distributed training
foundation models
multimodal LLMs
image/video generation models
parallelism schemes
computation and communication optimization
throughput scaling
GPU clusters
system reliability
resilience techniques
fast checkpointing
fault tolerance
failure diagnosis
network optimization
scheduling optimization
GPU memory management
performance optimization
exascale training systems
data-driven optimization methods
algorithm–system co-design
cross-layer optimization
training efficiency
scalability
reliability

Nice to have

reinforcement learning framework
high-performance inference
heterogeneous hardware compilation

What the JD emphasized

Deep expertise in large-scale distributed training of LLMs and multimodal models
Strong systems research background with demonstrated ability to design, build, and optimize large-scale ML systems
Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters
Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability

Other signals

large-scale distributed training
foundation models
multimodal LLMs
image/video generation models
GPU clusters
system reliability
exascale training systems

Read full job description

About the team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models
Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world infrastructure solutions

Requirements

Minimum Qualifications

Deep expertise in large-scale distributed training of LLMs and multimodal models
Strong systems research background with demonstrated ability to design, build, and optimize large-scale ML systems
Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters
Strong programming skills and hands-on experience implementing production-grade ML systems or infrastructure
Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability