Research Scientist Intern, Multimodal AI (phd)

Meta Meta · Big Tech · Redmond, WA

Research Scientist Intern focused on multimodal AI, specifically in audio and visual learning for AR/VR applications. The role involves designing evaluation protocols for LLMs, developing datasets, analyzing model performance, and creating novel algorithms for audio research problems. Collaboration with AI product teams and dissemination of findings are key.

What you'd actually do

  1. Design, implement, and maintain comprehensive evaluation protocols for large language models, including both automated and human-in-the-loop assessments.
  2. Develop and curate high-quality datasets and benchmarks to measure model performance, safety, fairness, and robustness across a variety of tasks and modalities.
  3. Analyze model outputs to identify strengths, weaknesses, and failure modes, and provide actionable insights to research and engineering teams.
  4. Design and implementation of novel algorithms to solve audio research problems.
  5. Collaboration with teams building Meta’s language AI products.

Skills

Required

  • Python
  • Matlab
  • PyTorch
  • TensorFlow
  • machine learning software platforms
  • building novel audio computational models
  • LLM
  • audio and speech quality assessment
  • multichannel audio processing
  • visual and acoustic scene analysis
  • manipulating and analyzing complex, large scale, high-dimensionality data
  • advancing AI techniques
  • core contributions to open source libraries and frameworks in computer vision or audio processing
  • theoretical and empirical research to solve problems
  • working and communicating cross functionally in a team environment

Nice to have

  • Transformer Models
  • Generative AI
  • Computer vision
  • multimodal representation learning
  • audio visual scene analysis
  • egocentric audio visual learning
  • multi-sensory speech enhancement
  • acoustic activity localization
  • open source repositories (e.g. Github)

What the JD emphasized

  • PhD degree
  • first-authored publications at leading workshops or top computer vision and machine learning conferences

Other signals

  • multimodal representation learning
  • audio visual scene analysis
  • egocentric audio visual learning
  • multi-sensory speech enhancement
  • acoustic activity localization