Senior Research Scientist/engineer - AI Infrastructure

ByteDance · Big Tech · San Jose, CA · Algorithm

Seeking an experienced Research Scientist/Engineer to design and build next-generation AI infrastructure at ByteDance, focusing on large-scale systems, AI, and emerging hardware to enable efficient and scalable AI workloads. The role involves architecting the end-to-end AI factory, exploring emerging trends, optimizing ML stack performance, and aligning cross-functional teams.

What you'd actually do

Design and evaluate scalable architectures across the full AI factory — compute, storage, networking, chips, power, and the data and application layers — for large-scale training, RL, and inference workloads. Develop technical proposals for supply-chain and energy constraints alongside silicon and software trade-offs.
Track emerging trends across AI systems, distributed training and RL, and hardware acceleration, as well as adjacent fields such as cognitive science and psychology that inform AI memory and reasoning substrates. Build prototypes and share insights through technical reports.
Analyze and optimize performance across the ML stack — scheduling, networking, storage, training and RL frameworks, and emerging AI memory systems for long-horizon agents — through benchmarking and bottleneck analysis.
Work across research, engineering, hardware, data-center, and product teams to translate AI workload requirements into scalable solutions and drive cross-team initiatives spanning the full AI factory.

Skills

Required

PhD in Computer Science, Computer Engineering, Electrical Engineering, or related technical discipline
Backgrounds in cognitive science, computational neuroscience, or psychology with strong systems fundamentals
Experience in distributed systems, infrastructure engineering, or ML systems
Exposure to large-scale training or RL pipelines
Comfort evaluating trade-offs across hardware, software, algorithms, energy, and supply-chain constraints
Strong proficiency in integrating AI tools into knowledge discovery and research workflows
Demonstrated ability to learn quickly and stay productive on a fast-evolving technical horizon
Excellent communication skills

Nice to have

Experience with large-scale model training and inference
distributed pretraining
post-training
RL
KV cache–aware serving
GPU/accelerator optimization
high-performance networking (e.g., RDMA, NCCL)
Experience with heterogeneous AI compute systems
large-scale training clusters
HPC-style distributed workloads
data pipelines for training and evaluation
Familiarity with AI memory systems
retrieval-augmented architectures
agent long-term memory designs
Exposure to cognitive-science or psychology literature on memory and reasoning
Exposure to chip-level design
data-center energy and cooling
AI hardware supply-chain considerations
Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys)
Contributions to open-source projects

What the JD emphasized

AI Factory Architecture
Research & Technology Exploration
AI Memory & System Performance Optimization
Cross-Team Technical Alignment
large-scale training
RL pipelines
large-scale model training
distributed pretraining
post-training
RL
KV cache–aware serving
GPU/accelerator optimization
heterogeneous AI compute systems
large-scale training clusters
HPC-style distributed workloads
data pipelines for training and evaluation
AI memory systems
retrieval-augmented architectures
agent long-term memory designs
chip-level design
data-center energy and cooling
AI hardware supply-chain considerations
AI factory

Other signals

AI infrastructure
large-scale systems
emerging hardware
AI workloads
AI factory

Apply on company site

● Active

Posted 5mo ago · last seen 1w ago · 142 days open

AI score: 9/10
Stage: Serve Data
Location: San Jose, CA
Role: Senior · Researcher
Function: Research
Domain: general
Maturity: Scaling

Skills

Agents & Autonomy

Autonomous Agents

Applied ML Domains

Content GenerationData ScienceManufacturing

Data Engineering

Data Pipelines

Frameworks & Tools

Apache AirflowNCCL

General Experience & Skills

Open Source ContributionsSoftware Engineering

Infrastructure & Systems

Computer ArchitectureDistributed SystemsGPU Kernel DevelopmentHigh-Performance ComputingInfiniBandInfrastructure ManagementTraining Infrastructure

LLM & Foundation Models

KV Cache OptimizationLLM Pretraining

ML Ops & Evaluation

Distributed TrainingFine-TuningProduction ML Systems

ML Techniques

Model Post-TrainingOptimization MethodsReinforcement Learning (RL)

Math & Foundations

Bayesian Methods

Research & Credentials

Cognitive SciencePublished Research

Robotics & Embodied

Mechatronics

Read full job description

We are seeking an experienced Research Scientist or Engineer to help define and build the next generation of AI infrastructure. In this role, you will work at the intersection of large-scale systems, AI, and emerging hardware to design infrastructure that enables reliable, efficient, and scalable AI workloads at ByteDance.

You will work closely with tech leaders, architects, and product teams to translate evolving AI requirements into robust infrastructure architectures. The role involves identifying emerging trends in AI algorithms and systems, designing scalable system architectures, and driving innovations that improve performance, reliability, and cost efficiency across the AI stack.

About the team: We are a lean architect & research team responsible for defining the next generation of AI infrastructure at Bytedance. AI is a fast-evolving horizon — pretraining, RL, and agentic workloads each reshape the requirements faster than traditional cloud abstractions can absorb — and our team is built to keep pace rather than simply react. We approach the problem as an end-to-end AI factory: a tightly coupled production system spanning data, applications, software infrastructure, chips, energy, and the broader supply chain. In this role, you will work at the intersection of large-scale systems, AI, emerging hardware, and the cognitive foundations of intelligent agents — including next-generation AI memory systems informed by cognitive science and psychology — designing scalable architectures and driving innovations across the full AI factory stack.

Responsibilities: AI Factory Architecture

Design and evaluate scalable architectures across the full AI factory — compute, storage, networking, chips, power, and the data and application layers — for large-scale training, RL, and inference workloads. Develop technical proposals for supply-chain and energy constraints alongside silicon and software trade-offs.

Research & Technology Exploration

Track emerging trends across AI systems, distributed training and RL, and hardware acceleration, as well as adjacent fields such as cognitive science and psychology that inform AI memory and reasoning substrates. Build prototypes and share insights through technical reports.

AI Memory & System Performance Optimization

Analyze and optimize performance across the ML stack — scheduling, networking, storage, training and RL frameworks, and emerging AI memory systems for long-horizon agents — through benchmarking and bottleneck analysis.

Cross-Team Technical Alignment

Work across research, engineering, hardware, data-center, and product teams to translate AI workload requirements into scalable solutions and drive cross-team initiatives spanning the full AI factory.

Requirements

Minimum Qualifications:

Individuals who are completing or recently completed a PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related technical discipline. Backgrounds in cognitive science, computational neuroscience, or psychology are also welcome when paired with strong systems fundamentals.
Experience in distributed systems, infrastructure engineering, or ML systems — including exposure to large-scale training or RL pipelines — and comfort evaluating trade-offs across hardware, software, algorithms, energy, and supply-chain constraints.
Strong proficiency in integrating AI tools into knowledge discovery and research workflows.
Demonstrated ability to learn quickly and stay productive on a fast-evolving technical horizon.
Excellent communication skills to collaborate across teams.

Preferred Qualifications:

Experience with large-scale model training and inference — distributed pretraining, post-training, RL, KV cache–aware serving, GPU/accelerator optimization, and high-performance networking (e.g., RDMA, NCCL).
Experience with heterogeneous AI compute systems, large-scale training clusters, HPC-style distributed workloads, and data pipelines for training and evaluation.
Familiarity with AI memory systems, retrieval-augmented architectures, or agent long-term memory designs — bonus for exposure to cognitive-science or psychology literature on memory and reasoning.
Exposure to chip-level design, data-center energy and cooling, or AI hardware supply-chain considerations across the AI factory.
Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys).
Contributions to open-source projects.