AI Research Scientist, Text Data Resear… at Meta

What you'd actually do

Collaborate with cross-functional teams to develop Meta’s next foundational models

Advance our understanding of data research, such as how to overcome data walls and how best to create synthetic data

Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling

Architect efficient and scalable data curation systems and pipelines

Execute on high priority projects in pre-training, mid-training, or post-training data curation

Skills

Required

PhD in Computer Science or a related technical field
2+ years of industry research experience in LLM/NLP or related AI/ML models
Practical experience with pre-training or mid-training data curation for large foundational models
Experience working with organic, synthetic, agentic, or reasoning data for LLMs
Hands-on experience with modeling frameworks like PyTorch
Hands-on experience on SQL and large-scale data handling
familiarity of frameworks like Spark and Hive

Nice to have

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams
Experience working on frontier-quality/state-of-the-art Large Language Models

Meta is seeking AI research scientists to help us build the data foundation for Meta's most advanced Large Language Models. We're looking for researchers with LLM expertise to join us on working with data at scale and to push beyond the data ceiling. Our team contributes to data curation across all stages of LLM development (pre-training, mid-training, post-training) and all domains/modalities (e.g., web, code, agent, multilingual). We tackle the hardest challenges at trillion-scale, including organic data curation, synthetic data generation, agent and interaction data, and frontier paradigms that redefine what's possible. Based in Meta Superintelligence Labs (MSL) within the Fundamental AI Research Organization (FAIR), you'll directly contribute to Meta’s frontier models like Llama, while having the chance to collaborate with researchers and engineers across MSL.

Responsibilities

Collaborate with cross-functional teams to develop Meta’s next foundational models Advance our understanding of data research, such as how to overcome data walls and how best to create synthetic data Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling Architect efficient and scalable data curation systems and pipelines Execute on high priority projects in pre-training, mid-training, or post-training data curation Apply specialized expertise in agentic data, synthetic data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization Lead complex technical projects end-to-end

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience PhD in Computer Science or a related technical field 2+ years of industry research experience in LLM/NLP or related AI/ML models Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams Practical experience with pre-training or mid-training data curation for large foundational models and experience working with organic, synthetic, agentic, or reasoning data for LLMs Published research in leading peer-reviewed conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) and/or demonstrated significant industry influence in the field of AI Experience working on frontier-quality/state-of-the-art Large Language Models Multiple first-author publications in leading peer-reviewed conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) Hands-on experience with modeling frameworks like PyTorch Hands-on experience on SQL and large-scale data handling, with familiarity of frameworks like Spark and Hive

AI Research Scientist, Text Data Research - Msl Fair

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications