What you'd actually do

Design and implement multi-modal understanding systems that combine vision, language, and other sensory inputs to enable richer contextual awareness

Develop algorithms for cross-modal learning, fusion, and reasoning to improve human-AI interaction

Lead the curation and management of multi-modal datasets, ensuring data quality and diversity across vision, language, and sensor modalities

Design and oversee ground truth annotation workflows and quality assurance processes for multi-modal data

Complete medium to large features spanning multiple tasks independently with minimal to no guidance

Skills

Required

C++
Python
PyTorch
TensorFlow
deep learning frameworks
cross-functional teams

Nice to have

Master's degree in Computer Science, Computer Vision, Machine Learning, or related field
vision-language models
multi-modal transformers
Publications or contributions to multi-modal understanding research
large language models
data curation
annotation tools
ground truth labeling pipelines

As a Research Engineer focused on Multi-Modal Understanding, you will develop advanced algorithms that integrate computer vision with other modalities such as language, audio, and sensor data. You will also drive the curation of multi-modal datasets and ground truth annotation pipelines to support model training and evaluation. You will work closely with our research team to bring innovative multi-modal solutions to production, bridging the gap between visual perception and holistic contextual understanding for immersive applications.

Responsibilities

Design and implement multi-modal understanding systems that combine vision, language, and other sensory inputs to enable richer contextual awareness Develop algorithms for cross-modal learning, fusion, and reasoning to improve human-AI interaction Lead the curation and management of multi-modal datasets, ensuring data quality and diversity across vision, language, and sensor modalities Design and oversee ground truth annotation workflows and quality assurance processes for multi-modal data Complete medium to large features spanning multiple tasks independently with minimal to no guidance Collaborate with researchers and engineers across computer vision and machine learning teams to drive multi-modal innovation Develop well-organized code with proper testing and documentation, building production-ready multi-modal systems

Qualifications

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta Proven experience with C++ and/or Python, including experience with modern features Experience working with deep learning frameworks such as PyTorch and TensorFlow Demonstrated experience working collaboratively in cross-functional teams Master's degree in Computer Science, Computer Vision, Machine Learning, or related field Experience with vision-language models or multi-modal transformers Publications or contributions to multi-modal understanding research Familiarity with large language models and their integration with visual understanding systems Experience with data curation, annotation tools, or ground truth labeling pipelines

Research Engineer, Computer Vision

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications