Research Engineer – AI Training Systems Reliability & Performance (seed Infra)

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Research Engineer focused on the reliability and performance of AI training systems, including distributed training, reinforcement learning frameworks, and high-performance inference for large foundation models. Responsibilities include building observability tools, managing cluster governance, and optimizing resource utilization.

What you'd actually do

  1. Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
  2. Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
  3. Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
  4. Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
  5. Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively

Skills

Required

  • C++
  • Python
  • systems engineering
  • PyTorch
  • distributed training frameworks
  • parallelization strategies
  • performance bottleneck analysis

Nice to have

  • torch.profiler
  • Nsight Systems
  • Nsight Compute
  • CUPTI
  • NVTX
  • Distributed communication fundamentals
  • NCCL
  • RDMA

What the JD emphasized

  • distributed training
  • reliability
  • performance

Other signals

  • distributed training
  • MLOps
  • performance optimization
  • reliability