Tech Lead, Research Scientist/engineer - AI Infrastructure

ByteDance ByteDance · Big Tech · San Jose, CA · Algorithm

Research Scientist/Engineer role focused on defining and building next-generation AI infrastructure for large-scale AI workloads, including training, RL, and inference, considering compute, storage, networking, chips, power, and data layers. The role involves tracking AI trends, optimizing system performance, and aligning cross-functional teams.

What you'd actually do

  1. Design and evaluate scalable architectures across the full AI factory — compute, storage, networking, chips, power, and the data and application layers — for large-scale training, RL, and inference workloads. Develop technical proposals for supply-chain and energy constraints alongside silicon and software trade-offs.
  2. Track emerging trends across AI systems, distributed training and RL, and hardware acceleration, as well as adjacent fields such as cognitive science and psychology that inform AI memory and reasoning substrates. Build prototypes and share insights through technical reports.
  3. Analyze and optimize performance across the ML stack — scheduling, networking, storage, training and RL frameworks, and emerging AI memory systems for long-horizon agents — through benchmarking and bottleneck analysis.
  4. Work across research, engineering, hardware, data-center, and product teams to translate AI workload requirements into scalable solutions and drive cross-team initiatives spanning the full AI factory.

Skills

Required

  • PhD in Computer Science, Computer Engineering, Electrical Engineering, or related technical discipline
  • Backgrounds in cognitive science, computational neuroscience, or psychology with strong systems fundamentals
  • Experience in distributed systems, infrastructure engineering, or ML systems
  • Exposure to large-scale training or RL pipelines
  • Evaluating trade-offs across hardware, software, algorithms, energy, and supply-chain constraints
  • Integrating AI tools into knowledge discovery and research workflows
  • Fast learning and productivity on evolving technical horizons
  • Excellent communication skills

Nice to have

  • Large-scale model training and inference
  • Distributed pretraining
  • Post-training
  • RL
  • KV cache–aware serving
  • GPU/accelerator optimization
  • High-performance networking (e.g., RDMA, NCCL)
  • Heterogeneous AI compute systems
  • Large-scale training clusters
  • HPC-style distributed workloads
  • Data pipelines for training and evaluation
  • AI memory systems
  • Retrieval-augmented architectures
  • Agent long-term memory designs
  • Cognitive-science or psychology literature on memory and reasoning
  • Chip-level design
  • Data-center energy and cooling
  • AI hardware supply-chain considerations
  • Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys)
  • Contributions to open-source projects

What the JD emphasized

  • Individuals who are completing or recently completed a PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related technical discipline.
  • Experience in distributed systems, infrastructure engineering, or ML systems — including exposure to large-scale training or RL pipelines — and comfort evaluating trade-offs across hardware, software, algorithms, energy, and supply-chain constraints.
  • Strong proficiency in integrating AI tools into knowledge discovery and research workflows.
  • Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys).

Other signals

  • AI infrastructure
  • large-scale systems
  • emerging hardware
  • AI workloads
  • AI factory