What you'd actually do

Conduct research and development in speech/audio foundation models

Collaborate with cross-functional teams to identify key research areas and contribute to the development of innovative speech/audio models.

Work with product development teams to integrate research findings into practical applications for ByteDance and other platforms.

Collaborate on team-driven projects to address complex challenges and enhance the overall effectiveness of the research team.

Skills

Required

Master's or PhD in computer science, mathematics, engineering or related field
3+ years of experience in machine learning and deep learning
Automatic Speech Recognition
Automatic Speech Translation
Speech/audio self-supervised learning and foundation models
Speaker recognition and verification
Speech emotion recognition
Multimodal foundation models
Large Language Model pre-training and fine-tuning
Python
C++

Nice to have

Publications in accredited ML/DL venues
Deep understanding of Large Language models
Distributed computing and large scale model training
Tensorflow
Pytorch
Engineering principles and best practices
Algorithms and programming

About the Team Established in 2023, the ByteDance Seed team is dedicated to pioneering new paths toward artificial general intelligence. We aspire to advance the frontier of intelligence to drive progress for both technology and society.

With a long-term vision for the AI sector, the Seed team's research spans MLLM, GenMedia, AI for Science, and Robotics. We maintain a global presence with laboratories and career opportunities across China, Singapore, and the United States. To date, we have launched industry-leading general foundation models and cutting-edge multimodal capabilities. Our technology powers over 50 application scenarios — including Doubao, Jimeng, TRAE, Dola and Dreamnia — and serves enterprise customers through Volcano Engine and BytePlus. Third-party data shows that the Doubao App ranks first in user volume in the Chinese market, while Doubao foundation models lead the industry in average daily token consumption.

The mission of the Seed Speech team is to enrich interactive and creative processes through the application of multimodal speech technologies. The team focuses on the forefront of research and product development in speech and audio, music, natural language understanding, and multimodal deep learning.

Responsibilities

Conduct research and development in speech/audio foundation models
Collaborate with cross-functional teams to identify key research areas and contribute to the development of innovative speech/audio models.
Work with product development teams to integrate research findings into practical applications for ByteDance and other platforms.
Collaborate on team-driven projects to address complex challenges and enhance the overall effectiveness of the research team.

Requirements

Minimum Qualifications

Master's or PhD in computer science, mathematics, engineering or related field
Have 3+ years of experience in one or more areas of machine learning and deep learning, including but not limited to: Automatic Speech Recognition, Automatic Speech Translation, Speech/audio self-supervised learning and foundation models, Speaker recognition and verification, Speech emotion recognition, Multimodal foundation models, Large Language Model pre-training and fine-tuning.

Preferred Qualifications

Publications in accredited ML/DL venues such as NeurIPS, ICLR, ICML, AAAI and speech venues such as ICASSP, ASRU, Interspeech
Deep understanding of Large Language models
Familiar with distributed computing and large scale model training
Familiar with deep learning frameworks such as Tensorflow and Pytorch.
Familiar with engineering principles and best practices.
Highly competent in algorithms and programming; Strong coding skills in C/C++ and Python.
Ability to work collaboratively in a fast-paced, multi-functional environments

Responsibilities

Conduct research and development in speech/audio foundation models
Collaborate with cross-functional teams to identify key research areas and contribute to the development of innovative speech/audio models.
Work with product development teams to integrate research findings into practical applications for ByteDance and other platforms.
Collaborate on team-driven projects to address complex challenges and enhance the overall effectiveness of the research team.

Requirements

Minimum Qualifications

Master's or PhD in computer science, mathematics, engineering or related field
Have 3+ years of experience in one or more areas of machine learning and deep learning, including but not limited to: Automatic Speech Recognition, Automatic Speech Translation, Speech/audio self-supervised learning and foundation models, Speaker recognition and verification, Speech emotion recognition, Multimodal foundation models, Large Language Model pre-training and fine-tuning.

Preferred Qualifications

Publications in accredited ML/DL venues such as NeurIPS, ICLR, ICML, AAAI and speech venues such as ICASSP, ASRU, Interspeech
Deep understanding of Large Language models
Familiar with distributed computing and large scale model training
Familiar with deep learning frameworks such as Tensorflow and Pytorch.
Familiar with engineering principles and best practices.
Highly competent in algorithms and programming; Strong coding skills in C/C++ and Python.
Ability to work collaboratively in a fast-paced, multi-functional environments

Research Scientist - Foundation Model, Speech Understanding

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Requirements

Requirements