Research Scientist

Meta · Big Tech · Pittsburgh, PA +1

Research Scientist at Meta Reality Labs focusing on multi-modal AI research, integrating vision, language, audio, and sensor data to build next-generation AI-powered interactions. The role involves leading research projects, developing and optimizing multi-modal models, and transitioning research into production with a focus on cross-modal alignment and fusion.

What you'd actually do

Lead the design, development, and optimization of multi-modal models that integrate vision, language, audio, and sensor inputs
Set technical direction for multi-modal research projects
Conduct research and experiments to improve cross-modal alignment and fusion strategies
Collaborate with cross-functional teams (engineering, HCI, product) to transition multi-modal research into production
Explore and adopt novel model optimization, quantization, and efficiency techniques

Skills

Required

PhD in Computer Science, Machine Learning, Computer Vision, or a related technical field
Expertise in multi-modal learning (architecture design, training, cross-modal alignment)
Programming experience in Python
Hands-on experience with deep learning frameworks (PyTorch)
Experience developing machine learning models at scale
5+ years of research experience with multiple modalities (vision, language, audio, sensor data)
Deep expertise in vision-language models, cross-modal attention mechanisms, or contrastive learning
First-authored publications at peer-reviewed AI conferences
Experience with on-device or edge multi-modal model optimization (quantization, sparsity, distillation)
Demonstrated software engineering experience
Experience bringing multi-modal AI products from research to production
Proven track record of developing multi-modal models that fuse vision, language, and/or audio

Nice to have

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

What the JD emphasized

multi-modal understanding
vision, language, audio, and sensor modalities
cross-modal alignment and fusion strategies
transition multi-modal research into production
multi-modal learning
vision-language models
First-authored publications at peer-reviewed AI conferences
on-device or edge multi-modal model optimization
bringing multi-modal AI products from research to production
developing multi-modal models that fuse vision, language, and/or audio for real-world applications

Other signals

multi-modal understanding
vision, language, audio, and sensor modalities
cross-modal alignment and fusion
research into production

Read full job description

Reality Labs at Meta is seeking a Research Scientist with expertise in multi-modal understanding to advance AI-powered interactions. We're building next-generation capabilities that integrate vision, language, audio, and sensor modalities. This is a unique opportunity to conduct cutting-edge multi-modal research with direct product impact.

Responsibilities

Lead the design, development, and optimization of multi-modal models that integrate vision, language, audio, and sensor inputs Set technical direction for multi-modal research projects Conduct research and experiments to improve cross-modal alignment and fusion strategies Collaborate with cross-functional teams (engineering, HCI, product) to transition multi-modal research into production Explore and adopt novel model optimization, quantization, and efficiency techniques Stay current with state-of-the-art advances in multi-modal learning, vision-language models, and related fields

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience Currently has, or is in the process of obtaining, a PhD in Computer Science, Machine Learning, Computer Vision, or a related technical field. Degree must be completed prior to joining Meta Demonstrated expertise in multi-modal learning — including architecture design, training, and cross-modal alignment techniques Programming experience in Python and hands-on experience with deep learning frameworks such as PyTorch Experience developing machine learning models at scale from inception to impact 5+ years of research experience working autonomously on ML problems involving multiple modalities (vision, language, audio, or sensor data) Deep expertise in vision-language models, cross-modal attention mechanisms, or contrastive learning approaches First-authored publications at peer-reviewed AI conferences (e.g., CVPR, NeurIPS, ICML, ICLR, ACL, ECCV) Experience with on-device or edge multi-modal model optimization (quantization, sparsity, distillation) Demonstrated software engineering experience via internship, work experience, or widely used contributions in open source repositories Experience bringing multi-modal AI products from research to production Proven track record of developing multi-modal models that fuse vision, language, and/or audio for real-world applications