Research Engineer - Msl Fair Foundations

Meta Meta · Big Tech · Menlo Park, CA +2

Research Engineer role focused on building and curating benchmarks and evaluation environments for advanced AI models (text, vision, audio). The role involves developing novel benchmarks, integrating existing ones, and creating scalable evaluation pipelines and tooling to directly impact research direction and model development. Requires strong ML engineering, Python, PyTorch, and experience with LLM/multimodal evaluation.

What you'd actually do

  1. Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development
  2. Develop and implement evaluation environments, including environments for novel model capabilities and modalities
  3. Collaborate with external data vendors to source and prepare high-quality evaluation datasets
  4. Execute on the technical vision of research scientists designing new benchmarks and evaluations
  5. Build robust, reusable evaluation pipelines that scale across multiple model lines and product areas

Skills

Required

  • Python
  • PyTorch
  • ML frameworks
  • software engineering practices
  • version control
  • testing
  • code review practices
  • language model post-training
  • deep learning systems
  • building reinforcement learning environments
  • implementing or developing evaluation benchmarks for large language models and multimodal models
  • large-scale distributed systems
  • data pipelines
  • language model evaluation frameworks and metrics

Nice to have

  • Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning
  • Track record of open-source contributions to ML evaluation tools or benchmarks

What the JD emphasized

  • evaluations are the core of AI progress
  • novel benchmarks
  • evaluation tooling at scale
  • Publications at peer-reviewed venues (NeurIPS, ICML, ICLR, ACL, EMNLP, or similar) related to language model evaluation, benchmarking, or deep learning
  • Experience implementing or developing evaluation benchmarks for large language models and multimodal models (e.g., vision-language, audio, video)

Other signals

  • evaluations are the core of AI progress
  • curate and build the benchmarks for our most advanced AI models
  • novel benchmarks and reinforcement learning environments
  • evaluation tooling at scale