Data Scientist, Evaluations - Meta Superintelligence Labs

Meta Meta · Big Tech · Menlo Park, CA +1

Meta is seeking a Data Scientist for their Superintelligence Labs to lead the design, validation, and analysis of novel AI evaluations and benchmarks. This role focuses on scientific rigor, measuring frontier AI capabilities, and influencing research directions through data-driven insights and publications.

What you'd actually do

  1. Lead the design of evaluation stimuli and benchmarks, ensuring they have minimal bias and high construct validity for frontier LLM capabilities
  2. Design and execute effective sampling strategies and experimental frameworks to measure model performance and errors accurately
  3. Perform rigorous data and model error analyses to provide deep insights into model behavior, quality gaps, and failure modes
  4. Partner closely with Research Scientists and Engineers to translate organizational priorities into measurable, scientifically sound benchmarks
  5. Drive the publication of novel evaluation research and the open-sourcing of benchmarks to influence the broader AI research community

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Bachelor's degree in Mathematics, Statistics, a relevant technical field, or equivalent practical experience
  • A minimum of 6 years of work experience in analytics (minimum of 4 years with a Ph.D.)
  • Experience with data querying languages (e.g. SQL), scripting languages (e.g. Python), and/or statistical/mathematical software (e.g. R)
  • Master’s or Ph.D. in a quantitative or experimentation-heavy field (e.g., Statistics, Psychology, Economics, Quantitative Social Sciences, or a related technical field)
  • Publications at top-tier peer-reviewed venues (e.g., NeurIPS, ICML, ICLR, ACL, or field-specific journals) related to measurement, evaluation, or experimental design
  • Recognized expertise in language model evaluation, psychometrics, or the science of benchmarking
  • A track record of open-source contributions to evaluation tools, datasets, or benchmarks
  • Familiarity with language model post-training, RLHF, or the nuances of LLM failure modes
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

What the JD emphasized

  • scientific rigor
  • frontier AI benchmarks
  • novel evaluations
  • AI capability measurement
  • rigorous, unbiased measurement
  • sampling strategies
  • benchmark quality and validity
  • model failures and limitations
  • novel research
  • measurement in uncharted territories
  • publication record
  • language model evaluation
  • science of benchmarking
  • evaluation tools, datasets, or benchmarks
  • language model post-training
  • RLHF
  • LLM failure modes
  • responsible, ethical AI practices
  • emerging AI technologies

Other signals

  • designing and validating novel evaluations
  • scientific rigor behind AI benchmarks
  • shaping the future of AI capability measurement
  • technical Data Science expert bridging abstract capabilities and rigorous measurement
  • leading sampling strategies
  • critically examining benchmark quality and validity
  • deep-dive analysis on model failures and limitations
  • conducting novel research
  • measurement in uncharted territories
  • influencing the broader AI research community