Research Scientist - Seed Multimodal Interaction and World Model

ByteDance · Big Tech · San Jose, CA · Computer vision

Research Scientist focused on pioneering AGI through large-scale multimodal foundation models, integrating video, audio, and language with a focus on visual latent reasoning and Reinforcement Learning. The role involves developing unified modeling frameworks and exploring RL-based approaches for multimodal visual reasoning and instruction-conditioned generation, aiming for human-level understanding and interaction capabilities.

What you'd actually do

Research and development large-scale multimodal foundation models
Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning
Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning
Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation

Skills

Required

Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling

Nice to have

Experience with reinforcement learning in multimodal or interactive environments
Familiarity with video generation or diffusion-based generative models
Experience with large-scale model training
Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models

What the JD emphasized

Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling
Experience with reinforcement learning in multimodal or interactive environments
Experience with large-scale model training

Other signals

pioneering new paths toward artificial general intelligence
advance the frontier of intelligence
launch industry-leading general foundation models
cutting-edge multimodal capabilities
human-level multimodal understanding and interaction capabilities
multimodal assistant products

Read full job description

About the Team Established in 2023, the ByteDance Seed team is dedicated to pioneering new paths toward artificial general intelligence. We aspire to advance the frontier of intelligence to drive progress for both technology and society.

With a long-term vision for the AI sector, the Seed team's research spans MLLM, GenMedia, AI for Science, and Robotics. We maintain a global presence with laboratories and career opportunities across China, Singapore, and the United States. To date, we have launched industry-leading general foundation models and cutting-edge multimodal capabilities. Our technology powers over 50 application scenarios — including Doubao, Jimeng, TRAE, Dola and Dreamnia — and serves enterprise customers through Volcano Engine and BytePlus. Third-party data shows that the Doubao App ranks first in user volume in the Chinese market, while Doubao foundation models lead the industry in average daily token consumption.

The Seed Multimodal Interaction and World Model team is dedicated to developing models that have boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products

Responsibilities

Research and development large-scale multimodal foundation models
Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning
Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning
Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation

Requirements

Minimum Qualifications

Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling

Preferred Qualifications

Experience with reinforcement learning in multimodal or interactive environments
Familiarity with video generation or diffusion-based generative models
Experience with large-scale model training
Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models

Responsibilities

Research and development large-scale multimodal foundation models
Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning
Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning
Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation

Requirements

Minimum Qualifications

Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling

Preferred Qualifications

Experience with reinforcement learning in multimodal or interactive environments
Familiarity with video generation or diffusion-based generative models
Experience with large-scale model training
Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models