As a Research Engineer focused on Multi-Modal Understanding, you will develop advanced algorithms that integrate computer vision with other modalities such as language, audio, and sensor data. You will also drive the curation of multi-modal datasets and ground truth annotation pipelines to support model training and evaluation. You will work closely with our research team to bring innovative multi-modal solutions to production, bridging the gap between visual perception and holistic contextual understanding for immersive applications.
Responsibilities
Design and implement multi-modal understanding systems that combine vision, language, and other sensory inputs to enable richer contextual awareness Develop algorithms for cross-modal learning, fusion, and reasoning to improve human-AI interaction Lead the curation and management of multi-modal datasets, ensuring data quality and diversity across vision, language, and sensor modalities Design and oversee ground truth annotation workflows and quality assurance processes for multi-modal data Complete medium to large features spanning multiple tasks independently with minimal to no guidance Collaborate with researchers and engineers across computer vision and machine learning teams to drive multi-modal innovation Develop well-organized code with proper testing and documentation, building production-ready multi-modal systems
Qualifications
Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta Proven experience with C++ and/or Python, including experience with modern features Experience working with deep learning frameworks such as PyTorch and TensorFlow Demonstrated experience working collaboratively in cross-functional teams Master's degree in Computer Science, Computer Vision, Machine Learning, or related field Experience with vision-language models or multi-modal transformers Publications or contributions to multi-modal understanding research Familiarity with large language models and their integration with visual understanding systems Experience with data curation, annotation tools, or ground truth labeling pipelines