Research Engineer - LLM Training Infrastructure - Seed Infra

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Research Engineer focused on large-scale LLM training infrastructure, optimizing distributed training strategies, system reliability, and performance across GPU clusters. Bridges research and production deployment.

What you'd actually do

  1. Conduct research and development on large-scale LLM training infrastructure and efficiency
  2. Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters
  3. Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads
  4. Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements
  5. Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

Skills

Required

  • large-scale distributed training for LLMs
  • Python
  • C++
  • ML systems / training infrastructure development
  • parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
  • training stack internals (PyTorch, CUDA, NCCL)
  • performance optimization (memory, communication, throughput)

Nice to have

  • distributed training frameworks
  • large-scale LLM infrastructure
  • leading or mentoring engineering teams
  • benchmarking AI accelerators
  • large-scale LLM evaluation

What the JD emphasized

  • large-scale distributed training for LLMs
  • ML systems / training infrastructure development
  • parallelism strategies
  • training stack internals
  • performance optimization

Other signals

  • large-scale distributed training
  • LLM training infrastructure
  • performance optimization
  • GPU clusters