Research Engineer – Multimodal Training Infrastructure (seed Infra)

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on building and optimizing large-scale distributed training infrastructure for foundation models, including multimodal LLMs and image/video generation models. This role involves deep expertise in parallelism strategies, system reliability, and performance optimization on large GPU clusters, bridging research and production deployment.

What you'd actually do

  1. Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models
  2. Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
  3. Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
  4. Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
  5. Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

Skills

Required

  • large-scale distributed training
  • foundation models
  • multimodal LLMs
  • image/video generation models
  • parallelism schemes
  • computation and communication optimization
  • throughput scaling
  • GPU clusters
  • system reliability
  • resilience techniques
  • fast checkpointing
  • fault tolerance
  • failure diagnosis
  • network optimization
  • scheduling optimization
  • GPU memory management
  • performance optimization
  • exascale training systems
  • data-driven optimization methods
  • algorithm–system co-design
  • cross-layer optimization
  • training efficiency
  • scalability
  • reliability

Nice to have

  • reinforcement learning framework
  • high-performance inference
  • heterogeneous hardware compilation

What the JD emphasized

  • Deep expertise in large-scale distributed training of LLMs and multimodal models
  • Strong systems research background with demonstrated ability to design, build, and optimize large-scale ML systems
  • Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters
  • Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability

Other signals

  • large-scale distributed training
  • foundation models
  • multimodal LLMs
  • image/video generation models
  • GPU clusters
  • system reliability
  • exascale training systems