Research Scientist - Seed Multimodal Interaction and World Model

ByteDance ByteDance · Big Tech · San Jose, CA · Computer vision

Research Scientist focused on pioneering AGI through large-scale multimodal foundation models, integrating video, audio, and language with a focus on visual latent reasoning and Reinforcement Learning. The role involves developing unified modeling frameworks and exploring RL-based approaches for multimodal visual reasoning and instruction-conditioned generation, aiming for human-level understanding and interaction capabilities.

What you'd actually do

  1. Research and development large-scale multimodal foundation models
  2. Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning
  3. Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning
  4. Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation

Skills

Required

  • Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
  • Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
  • Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling

Nice to have

  • Experience with reinforcement learning in multimodal or interactive environments
  • Familiarity with video generation or diffusion-based generative models
  • Experience with large-scale model training
  • Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models

What the JD emphasized

  • Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
  • Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling
  • Experience with reinforcement learning in multimodal or interactive environments
  • Experience with large-scale model training

Other signals

  • pioneering new paths toward artificial general intelligence
  • advance the frontier of intelligence
  • launch industry-leading general foundation models
  • cutting-edge multimodal capabilities
  • human-level multimodal understanding and interaction capabilities
  • multimodal assistant products