What you'd actually do

Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development

Develop and implement evaluation environments, including environments for novel model capabilities and modalities

Collaborate with external data vendors to source and prepare high-quality evaluation datasets

Execute on the technical vision of research scientists designing new benchmarks and evaluations

Build robust, reusable evaluation pipelines that scale across multiple model lines and product areas

Skills

Required

PhD degree in Computer Science, Machine Learning, or a related technical field
3+ years of experience in machine learning engineering, machine learning research, or a related technical role
Proficiency in Python and experience with ML frameworks such as PyTorch
Experience identifying, designing and completing medium to large technical features independently, without guidance
Proven success in software engineering practices including version control, testing, and code review practices
Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning
Hands-on experience with language model post-training and deep learning systems, or building reinforcement learning environments
Experience implementing or developing evaluation benchmarks for large language models and multimodal models (e.g., vision-language, audio, video)
Experience working with large-scale distributed systems and data pipelines
Familiarity with language model evaluation frameworks and metrics
Track record of open-source contributions to ML evaluation tools or benchmarks

Nice to have

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

What the JD emphasized

Evaluations are the core of AI progress

novel evaluations

AI capability measurement

scientific validity

methodological rigor

measurable benchmarks

evaluation insights

frontier AI development

evaluation environments

evaluation datasets

evaluation pipelines

evaluation suites

language model evaluation

benchmarking

language model evaluation frameworks and metrics

ML evaluation tools or benchmarks

Meta is seeking Research Scientists to join the Evaluations team within Meta Superintelligence Labs (MSL). Evaluations are the core of AI progress at MSL, determining what capabilities get built, which features get prioritized, and how fast our models improve. As a Research Scientist, you will provide the technical capabilities to measure and understand the capabilities of our frontier AI systems. You'll work in tandem with world-class researchers to envision, develop, and validate novel evaluations that shape the future of AI capability measurement. This is a technical research role requiring good scientific judgment, creativity, and the ability to drive ambitious research agendas with independence. The evaluations you develop will directly influence research direction and major model lines within MSL, making scientific validity, methodological rigor, and clear communication important. You will collaborate closely with technical leadership to ensure evaluations capture the most important capabilities, translating organizational priorities into measurable benchmarks, and translating evaluation insights back into research direction. We are looking for exceptional research talent – researchers who have shaped the field of machine learning, and are ready to do so again at the frontier of AI. If you are passionate about defining how we measure AI progress and want to shape the scientific foundations of frontier AI development, we encourage you to apply for this exciting opportunity at the core of MSL.

Responsibilities

Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development Develop and implement evaluation environments, including environments for novel model capabilities and modalities Collaborate with external data vendors to source and prepare high-quality evaluation datasets Execute on the technical vision of research scientists designing new benchmarks and evaluations Build robust, reusable evaluation pipelines that scale across multiple model lines and product areas Contribute to evaluation tooling that measures the quality and reliability of evaluation suites

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience PhD degree in Computer Science, Machine Learning, or a related technical field 3+ years of experience in machine learning engineering, machine learning research, or a related technical role Proficiency in Python and experience with ML frameworks such as PyTorch Experience identifying, designing and completing medium to large technical features independently, without guidance Proven success in software engineering practices including version control, testing, and code review practices Ability to work independently and adapt to rapidly changing priorities Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning Hands-on experience with language model post-training and deep learning systems, or building reinforcement learning environments Experience implementing or developing evaluation benchmarks for large language models and multimodal models (e.g., vision-language, audio, video) Experience working with large-scale distributed systems and data pipelines Familiarity with language model evaluation frameworks and metrics Track record of open-source contributions to ML evaluation tools or benchmarks

AI Research Scientist - Msl Fair Foundations

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications