Senior Research Scientist (multimodal Large Language Model) - Pico

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Research Scientist role focused on developing multimodal large language models (MLLM) with tool-use capabilities for Mixed Reality (MR) environments. This involves optimizing model architectures, enabling tool utilization for complex tasks, and addressing challenges in long-horizon, multi-turn interactions. The role also includes applying and deploying innovative technologies in PICO's MR products and collaborating with cross-functional teams.

What you'd actually do

  1. Lead the R&D of multimodal large language models (MLLM) tailored for MR scenarios, integrating vision, point clouds, text, and other multimodal information—including model architecture optimization, cross-modal alignment, data construction, evaluation system enhancement, and end-to-end training/inference acceleration.
  2. Drive the research and implementation of MLLM tool-use capabilities in MR environments, enabling models to proficiently utilize spatial interaction and spatial computing-related professional tools, support tool calls for both single-turn and multi-turn conversations, and solve complex user tasks through interaction.
  3. Address key challenges in long-horizon, multi-turn tool-augmented tasks in MR, such as context memory management, tool selection strategy, and error correction mechanisms.
  4. Keep abreast of cutting-edge technologies in MLLM, multimodal intelligence, and tool-use research, and lead the application and deployment of innovative technologies in PICO's MR products.
  5. Collaborate with cross-functional teams (including software engineering, product design, and hardware development) to translate research outcomes into practical features that enhance user experience.

Skills

Required

  • Master's or Ph.D. degree in Computer Science, Electrical Engineering, Machine Learning, Artificial Intelligence, or a related quantitative field.
  • Expertise in multimodal large model pre-training, post-training, fine-tuning, or cross-modal fusion technologies, with hands-on experience in model optimization, training workflow design, and performance tuning.
  • Proven research experience in LLM tool use, reinforcement learning, LLM agents, or interactive learning, with a deep understanding of single-turn and multi-turn interaction mechanisms.
  • Proficiency in core 2D/3D computer vision tasks, including detection, segmentation, depth estimation, image matching, and 3D scene perception.
  • Skilled in Python and C++, with solid programming capabilities and experience in developing large-scale models using mainstream deep learning frameworks (PyTorch/TensorFlow).
  • Excellent problem-solving and independent research abilities, capable of addressing complex technical challenges in the integration of MR and MLLM tool use.

Nice to have

  • Publications in AI/ML/CV conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP) focusing on multimodal large models, LLM tool use, or agent systems.
  • Hands-on experience in building large-scale MLLM training pipelines, tool-use evaluation systems, or multimodal agent platforms.
  • Familiarity with MR/AR/VR technologies, spatial computing, or 3D scene reconstruction (3DGS, NeRF, etc.) is a strong plus.
  • Experience in addressing long-horizon reasoning or asynchronous agent behavior challenges is highly valued.
  • Award winners of competitions such as ACM-ICPC, NOI/IOI, TopCoder, or AI/ML contests (e.g., Kaggle) are preferred.
  • Strong collaboration and communication skills, able to lead research initiatives and drive cross-team technical alignment.

What the JD emphasized

  • Expertise in multimodal large model pre-training, post-training, fine-tuning, or cross-modal fusion technologies
  • Proven research experience in LLM tool use, reinforcement learning, LLM agents, or interactive learning
  • Proficiency in core 2D/3D computer vision tasks
  • Publications in AI/ML/CV conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP) focusing on multimodal large models, LLM tool use, or agent systems

Other signals

  • multimodal large language models
  • tool-use capabilities
  • agent systems
  • MR environments