What you'd actually do

Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development

Develop and implement evaluation environments, including environments for novel model capabilities and modalities

Collaborate with external data vendors to source and prepare high-quality evaluation datasets

Execute on the technical vision of research scientists designing new benchmarks and evaluations

Build robust, reusable evaluation pipelines that scale across multiple model lines and product areas

Skills

Required

Python
PyTorch
ML frameworks
software engineering practices
version control
testing
code review practices
language model post-training
deep learning systems
building reinforcement learning environments
implementing or developing evaluation benchmarks for large language models and multimodal models
large-scale distributed systems
data pipelines
language model evaluation frameworks and metrics

Nice to have

Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning
Track record of open-source contributions to ML evaluation tools or benchmarks

What the JD emphasized

evaluations are the core of AI progress

novel benchmarks

evaluation tooling at scale

Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning

Experience implementing or developing evaluation benchmarks for large language models and multimodal models (e.g., vision-language, audio, video)

Meta is seeking Research Engineers to join the Evaluations team within Meta Superintelligence Labs. Evaluations are the core of AI progress at MSL, determining what capabilities get built, which features get prioritized, and how fast our models improve. As a Research Engineer on this team, you will curate and build the benchmarks for our most advanced AI models, across text, vision, audio, and beyond. You'll work alongside world-class researchers and engineers to collect, develop, and deploy novel benchmarks and reinforcement learning environments. This is a highly technical role requiring solid research engineering skills and the ability to work independently on a variety of open-ended machine learning challenges with high reliability. The evaluations you build will directly impact the research direction and major model lines within MSL, making engineering reliability, rigor, and scalability paramount. You will excel by maintaining high velocity while adapting to rapidly shifting priorities as we advance the technical research frontier. You'll need to be flexible and adaptive, tackling a wide variety of problems in the evaluations space, from implementing existing benchmarks to developing novel benchmarks and environments to implementing evaluation tooling at scale. If you are passionate about defining the capabilities that drive AI progress and thrive in fast-paced, high-impact research environments, we encourage you to apply for this exciting opportunity at the core of MSL.

Responsibilities

Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development Develop and implement evaluation environments, including environments for novel model capabilities and modalities Collaborate with external data vendors to source and prepare high-quality evaluation datasets Execute on the technical vision of research scientists designing new benchmarks and evaluations Build robust, reusable evaluation pipelines that scale across multiple model lines and product areas Contribute to evaluation tooling that measures the quality and reliability of evaluation suites

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 3+ years of experience in machine learning engineering, machine learning research, or a related technical role Proficiency in Python and experience with ML frameworks such as PyTorch Experience identifying, designing and completing medium to large technical features independently, without guidance Demonstrated experience in software engineering practices including version control, testing, and code review practices Ability to work independently and adapt to rapidly changing priorities Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning Hands-on experience with language model post-training and deep learning systems, or building reinforcement learning environments Experience implementing or developing evaluation benchmarks for large language models and multimodal models (e.g., vision-language, audio, video) Experience working with large-scale distributed systems and data pipelines Familiarity with language model evaluation frameworks and metrics Track record of open-source contributions to ML evaluation tools or benchmarks

Research Engineer - Msl Fair Foundations

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications