AI Research Scientist, Text Data Research - Msl Fair

Meta Meta · Big Tech · Bellevue, WA +2

AI Research Scientist focused on building the data foundation for Meta's advanced Large Language Models, contributing to data curation across pre-training, mid-training, and post-training stages, and exploring frontier paradigms for data at scale.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Advance our understanding of data research, such as how to overcome data walls and how best to create synthetic data
  3. Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  4. Architect efficient and scalable data curation systems and pipelines
  5. Execute on high priority projects in pre-training, mid-training, or post-training data curation

Skills

Required

  • PhD in Computer Science or a related technical field
  • 2+ years of industry research experience in LLM/NLP or related AI/ML models
  • Practical experience with pre-training or mid-training data curation for large foundational models
  • Experience working with organic, synthetic, agentic, or reasoning data for LLMs
  • Hands-on experience with modeling frameworks like PyTorch
  • Hands-on experience on SQL and large-scale data handling
  • familiarity of frameworks like Spark and Hive

Nice to have

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams
  • Experience working on frontier-quality/state-of-the-art Large Language Models

What the JD emphasized

  • Published research in leading peer-reviewed conferences
  • Multiple first-author publications in leading peer-reviewed conferences

Other signals

  • building data foundation for LLMs
  • data curation across pre-training, mid-training, post-training
  • organic data curation, synthetic data generation, agent and interaction data
  • frontier paradigms
  • contribute to Meta's frontier models like Llama