Research Engineer – Reinforcement Learning (rl) Systems & Infrastructure (seed Infra)

ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on building and optimizing distributed reinforcement learning systems and infrastructure for large-scale AI foundation models. This role involves designing end-to-end RL pipelines, optimizing training performance on GPU clusters, and collaborating with researchers on system-algorithm co-design.

What you'd actually do

Design and build end-to-end reinforcement learning (RL) systems for large-scale models, covering rollout, training, evaluation, and deployment pipelines.
Develop scalable and fault-tolerant RL infrastructure that operates efficiently under dynamic workloads and heterogeneous compute environments.
Optimize distributed training performance across GPU clusters, improving throughput, resource utilization, and system stability.
Collaborate with cross-team researchers on targeted system–algorithm co-design to translate research ideas into robust, production-grade implementations.
Build tooling, monitoring, and debugging frameworks to ensure reliability and observability of large-scale RL training systems.

Skills

Required

distributed systems
large-scale ML systems
deep learning infrastructure
large-scale training systems
Python
C++
PyTorch
distributed training frameworks
GPU optimization
parallelism strategies
system-level performance tuning
reinforcement learning workflows

Nice to have

large-scale agent systems
system design under heterogeneous or dynamic workloads
RL + LLM training
post-training pipelines

What the JD emphasized

large-scale models
reinforcement learning
distributed training
large-scale training systems
reinforcement learning workflows

Other signals

distributed training
reinforcement learning
large-scale models
infrastructure

Read full job description

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Design and build end-to-end reinforcement learning (RL) systems for large-scale models, covering rollout, training, evaluation, and deployment pipelines.
Develop scalable and fault-tolerant RL infrastructure that operates efficiently under dynamic workloads and heterogeneous compute environments.
Optimize distributed training performance across GPU clusters, improving throughput, resource utilization, and system stability.
Collaborate with cross-team researchers on targeted system–algorithm co-design to translate research ideas into robust, production-grade implementations.
Build tooling, monitoring, and debugging frameworks to ensure reliability and observability of large-scale RL training systems.

Requirements

Minimum Qualifications:

Strong background in distributed systems, large-scale ML systems, or deep learning infrastructure
Experience building or optimizing large-scale training systems (e.g., RL, LLM, multimodal models)
Solid engineering skills in Python/C++ and familiarity with modern ML stacks (PyTorch, distributed training frameworks, etc.)
Experience with GPU optimization, parallelism strategies, and system-level performance tuning
Understanding of reinforcement learning workflows (rollout, policy update, evaluation loops)

Preferred Qualifications:

Experience with large-scale agent systems
Familiarity with system design under heterogeneous or dynamic workloads
Exposure to RL + LLM training or post-training pipelines