What you'd actually do

Investigate systematic phonemic errors as causal factors in perceived speech quality degradation, and link them to human perceptual judgments

Build and curate datasets and benchmarks of speech for phoneme-level analysis

Explore and compare the capabilities of audio and video (multimodal) LLMs as tools to support this analysis

Relate findings to human perceptual data (quality preference and intelligibility) and translate them into actionable insights for research and engineering teams

Where appropriate, adapt multimodal models to the task in a supporting capacity

Skills

Required

Python
Matlab
PyTorch
TensorFlow
speech perception
psychoacoustics
acoustic phonetics
audio computational models
LLMs
software engineering
open source contributions
audio processing
speech quality assessment
multichannel audio processing
visual and acoustic scene analysis
large scale data analysis
theoretical and empirical research
cross-functional communication

Nice to have

Generative AI
Transformer Models
Computer vision
multimodal models
video LLMs

What the JD emphasized

PhD degree

3+ years experience with Python, Matlab, or similar

3+ years experience with machine learning software platforms such as PyTorch, TensorFlow, etc

Background in speech perception, psychoacoustics, or acoustic phonetics

Experience deploying novel audio computational models and LLMs

Demonstrated software engineer experience via an internship, work experience, coding competitions, or widely used contributions in open source repositories (e.g. Github)

Experience in advancing AI techniques, including core contributions to open source libraries and frameworks in computer vision or audio processing

Experience with audio and speech quality assessment

Experience with multichannel audio processing

Experience in visual and acoustic scene analysis

Experience manipulating and analyzing complex, large scale, high-dimensionality data from varying sources

Proven track record of achieving significant results as demonstrated by grants, fellowships, patents, as well as first-authored publications at leading workshops or top computer vision and machine learning conferences such as ARO, ASA, NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, ICASSP, InterSpeech or simila

Experience in utilizing theoretical and empirical research to solve problems

Experience working and communicating cross functionally in a team environment

The Meta Reality Labs Research Team brings together a world-class team of researchers, developers, and engineers to create the future of virtual and augmented reality, which together will become as universal and essential as smartphones and personal computers are today. And just as personal computers have done over the past 45 years, AR, VR and MR will ultimately change everything about how we work, play, and connect. We are developing all the technologies needed to enable breakthrough AR glasses and VR headsets, including optics and displays, computer vision, audio, graphics, brain-computer interfaces, haptic interaction, eye/hand/face/body tracking, perception science, and true telepresence. Some of those will advance much faster than others, but they all need to happen to enable AR, VR and MR that are so compelling that they become an integral part of our lives. In particular, the Meta Reality Labs Research audio team is focused on two goals; creating virtual sounds that are perceptually indistinguishable from reality, and redefining human hearing. See more about our work here: Inside Facebook Reality Labs Research: The future of audio and Filter Out the Noise With Conversation Focus. These two initiatives will allow us to connect people by allowing them to feel together despite being physically apart, and allow them to converse in even the most difficult listening environments. Meta Reality Labs Research is looking for an intern who is passionate about speech perception and audio quality to investigate why processed speech sometimes sounds degraded or robotic. The project focuses on identifying systematic phonemic errors as causal factors in perceived quality degradation, and linking these errors to human quality and intelligibility judgments. A core method is to explore the capabilities of audio vs video LLMs. This is fundamentally a speech-perception research role; multimodal/LLM methods are a supporting tool rather than the central focus. Our internships are twelve (12) to twenty four (24) weeks long and we have various start dates throughout the year.

Responsibilities

Investigate systematic phonemic errors as causal factors in perceived speech quality degradation, and link them to human perceptual judgments Build and curate datasets and benchmarks of speech for phoneme-level analysis Explore and compare the capabilities of audio and video (multimodal) LLMs as tools to support this analysis Relate findings to human perceptual data (quality preference and intelligibility) and translate them into actionable insights for research and engineering teams Where appropriate, adapt multimodal models to the task in a supporting capacity Collaborate with researchers, engineers, and cross-functional partners to define goals, communicate findings, and drive improvements in speech quality Develop tools and infrastructure to streamline and scale the analysis Stay current with research in speech perception and audio quality and intelligibility assessment, and incorporate best practices into Meta's workflows Disseminate results through internal reports and presentations, and, when appropriate, external publications

Qualifications

Currently has, or is in the process of obtaining, a PhD degree in the field of Speech and Hearing Science, Auditory Neuroscience, Computational Neuroscience, Computer Science, Artificial Intelligence, Generative AI, Transformer Models, Machine Learning, Signal Processing or Computer vision 3+ years experience with Python, Matlab, or similar 3+ years experience with machine learning software platforms such as PyTorch, TensorFlow, etc Background in speech perception, psychoacoustics, or acoustic phonetics Experience deploying novel audio computational models and LLMs Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment Experience building novel audio computational models and LLMs Demonstrated software engineer experience via an internship, work experience, coding competitions, or widely used contributions in open source repositories (e.g. Github) Experience in advancing AI techniques, including core contributions to open source libraries and frameworks in computer vision or audio processing Experience with audio and speech quality assessment Experience with multichannel audio processing Experience in visual and acoustic scene analysis Experience manipulating and analyzing complex, large scale, high-dimensionality data from varying sources Proven track record of achieving significant results as demonstrated by grants, fellowships, patents, as well as first-authored publications at leading workshops or top computer vision and machine learning conferences such as ARO, ASA, NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, ICASSP, InterSpeech or simila Experience in utilizing theoretical and empirical research to solve problems Experience working and communicating cross functionally in a team environment Intent to return to a degree-program after the completion of the internship/co-op

Research Scientist Intern, Audio Quality With AI (phd)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications