Large Model Training Acceleration Engineer

ByteDance · Big Tech · San Jose, CA · R&D

ByteDance's Intelligent Creation - AI Platform team is looking for an experienced AI model optimization engineer to optimize large model training pipelines, develop distributed training strategies, and benchmark deep learning models. The role requires expertise in Python, C++, CUDA, deep learning frameworks (PyTorch, Megatron, Deepspeed), distributed training techniques, and knowledge of transformers and diffusion models.

What you'd actually do

Optimize large model training pipelines to improve efficiency, speed, and scalability.
Develop and improve distributed training strategies such as data parallelism, model parallelism, pipeline parallelism and communication to accelerate model training.
Benchmark and profile deep learning models to identify performance bottlenecks and optimize computational resources.

Skills

Required

Python
C++
CUDA
PyTorch
Megatron
Deepspeed
data parallelism
model parallelism
pipeline parallelism
transformers
diffusion models

Nice to have

inference optimization

What the JD emphasized

AI model training optimization
distributed training
transformers and diffusion models

Other signals

large-scale generative AI models
optimizing AI model training and inference
distributed training/inference and acceleration

Read full job description

Team Introduction The Intelligent Creation - AI Platform team is a team focusing on building advanced end-to-end AI production pipelines, including deep learning model training, optimization, deployment and applications. We provide AI capabilities to empower content creation and consumption on TikTok and serve billions of users.

We are seeking an experienced AI model optimization engineer with expertise in optimizing AI model training and inference, including distributed training/inference and acceleration. The ideal candidate will work at the cutting edge of AI efficiency, enhancing the performance, scalability, and deployment of large-scale generative AI models.

Responsibilities

Optimize large model training pipelines to improve efficiency, speed, and scalability.
Develop and improve distributed training strategies such as data parallelism, model parallelism, pipeline parallelism and communication to accelerate model training.
Benchmark and profile deep learning models to identify performance bottlenecks and optimize computational resources.

Requirements

Minimum Qualifications:

Master’s or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or a related field.
2+ years of experience in AI model training optimization.
Strong software engineering skills, including proficiency in Python, C++, and CUDA.
Strong proficiency in deep learning frameworks such as PyTorch, Megatron and Deepspeed.
Experience with distributed training techniques such as data parallelism, model parallelism, and pipeline parallelism.
Knowledge of transformers and diffusion models.