Research Engineer – Reinforcement Learning (rl) Systems & Infrastructure (seed Infra)

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on building and optimizing distributed reinforcement learning systems and infrastructure for large-scale AI foundation models. This role involves designing end-to-end RL pipelines, optimizing training performance on GPU clusters, and collaborating with researchers on system-algorithm co-design.

What you'd actually do

  1. Design and build end-to-end reinforcement learning (RL) systems for large-scale models, covering rollout, training, evaluation, and deployment pipelines.
  2. Develop scalable and fault-tolerant RL infrastructure that operates efficiently under dynamic workloads and heterogeneous compute environments.
  3. Optimize distributed training performance across GPU clusters, improving throughput, resource utilization, and system stability.
  4. Collaborate with cross-team researchers on targeted system–algorithm co-design to translate research ideas into robust, production-grade implementations.
  5. Build tooling, monitoring, and debugging frameworks to ensure reliability and observability of large-scale RL training systems.

Skills

Required

  • distributed systems
  • large-scale ML systems
  • deep learning infrastructure
  • large-scale training systems
  • Python
  • C++
  • PyTorch
  • distributed training frameworks
  • GPU optimization
  • parallelism strategies
  • system-level performance tuning
  • reinforcement learning workflows

Nice to have

  • large-scale agent systems
  • system design under heterogeneous or dynamic workloads
  • RL + LLM training
  • post-training pipelines

What the JD emphasized

  • large-scale models
  • reinforcement learning
  • distributed training
  • large-scale training systems
  • reinforcement learning workflows

Other signals

  • distributed training
  • reinforcement learning
  • large-scale models
  • infrastructure