AI Research Scientist, Audio-visual Understanding, Fair

Meta Meta · Big Tech · Menlo Park, CA +2

Research Scientist in FAIR focused on advancing AI science and developing technology towards superintelligence. The role involves developing joint audio-visual understanding systems, building and evaluating audiovisual language models for social interactions, and contributing to benchmarks. Requires a PhD and research experience in ML/AI, with a focus on computer vision, speech, and multimodal learning for embodied conversational agents.

What you'd actually do

  1. Develop joint audio-visual understanding systems that integrate visual and auditory signals for advanced perception
  2. Build and evaluate audiovisual language models for social interactions and understanding, including predicting social intent, semantic function, and reasoning from human-centric inputs
  3. Contribute to benchmarks and evaluation frameworks for visual social understanding and interactions
  4. Train and optimize state-of-the-art machine learning and neural network methodologies
  5. Conduct and collaborate on research projects within a globally-based team

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • PhD in AI, computer science, data science, or related technical fields
  • Research background in machine learning, artificial intelligence, computational statistics, or applied mathematics, or related areas
  • Research publications reflecting experience in theoretical or empirical research
  • Developing and debugging in Python or similar programming languages
  • Analyzing and collecting data from various sources
  • Demonstrated research and software engineering experience via an internship, work experience, coding competitions, or widely used contributions in open source repositories (e.g. GitHub)
  • Experience with audio-visual learning or multimodal fusion techniques
  • Experience with vision-language models (VLMs) such as LLaVA, GPT-4V, Gemini, or similar architectures

Nice to have

  • Experience in computer vision
  • Experience in speech and multimodal learning
  • Familiarity with human action recognition, social signal processing, or human-centric video understanding
  • Experience with long-form video understanding, video-language models, or streaming perception systems
  • Experience with temporal modeling, video transformers, or recurrent architectures for sequential data

What the JD emphasized

  • PhD in AI, computer science, data science, or related technical fields
  • Research publications reflecting experience in theoretical or empirical research
  • Experience with audio-visual learning or multimodal fusion techniques
  • Experience with vision-language models (VLMs) such as LLaVA, GPT-4V, Gemini, or similar architectures

Other signals

  • advancing the science of intelligence
  • developing technology toward achieving superintelligence
  • perceptual foundations for real-time embodied conversational agents