Research Engineer (technical Leadership), Fair Data - Meta Superintelligence Labs

Meta · Big Tech · Menlo Park, CA

Research Engineers at Meta Superintelligence Labs (FAIR) are responsible for building the data foundation for advanced Large Language and Media Models. This includes data curation across pre-training, mid-training, and post-training stages, focusing on challenges at trillion-scale such as organic and synthetic data generation, agent data, and frontier paradigms. The role involves architecting scalable data systems, improving data velocity, and applying expertise in various data modalities and domains.

What you'd actually do

Collaborate with cross-functional teams to develop Meta’s next foundational models
Architect efficient and scalable data curation systems and pipelines
Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
Execute on high priority projects in pre-training, mid-training, or post-training data curation
Apply specialized expertise in video/image perception or generation, OCR, agentic data, synthetic data, multilingual data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization

Skills

Required

Python
PyTorch
Spark
SQL
large-scale data handling
data curation
pre-training
post-training
LLM expertise

Nice to have

Hive
Ray
DataFlow
video/image perception or generation
OCR
agentic data
synthetic data
multilingual data
reasoning data
web parser
coding data
data scaling laws
datamix optimization

What the JD emphasized

4+ years of industry research experience with pre/mid/post-training data curation for large language or large media models
4+ years of formal technical lead experience
Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI
First-author publications at top peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV)
Experience working on frontier-quality/state-of-the-art Large Language or Large Media Models

Other signals

data curation for LLMs
trillion-scale data
organic data curation
synthetic data generation
agent and interaction data
frontier paradigms

Read full job description

Meta is seeking Research Engineers to help us build the data foundation for Meta's most advanced Large Language and Media Models. We're looking for researchers with LLM expertise to join us on working with data at scale and to push beyond the data ceiling. Our team contributes to data curation across all stages of LLM development (pre-training, mid-training, post-training) and all domains/modalities (e.g., web, code, image, video, multilingual). We tackle the hardest challenges at trillion-scale, including organic data curation, synthetic data generation, agent and interaction data, and frontier paradigms that redefine what's possible. Based in Meta Superintelligence Labs (MSL) within the Fundamental AI Research Organization (FAIR), you'll directly contribute to Meta’s frontier models, while having the chance to collaborate with researchers and engineers across MSL.

Responsibilities

Collaborate with cross-functional teams to develop Meta’s next foundational models Architect efficient and scalable data curation systems and pipelines Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling Execute on high priority projects in pre-training, mid-training, or post-training data curation Apply specialized expertise in video/image perception or generation, OCR, agentic data, synthetic data, multilingual data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization Lead complex technical projects end-to-end

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 4+ years of industry research experience with pre/mid/post-training data curation for large language or large media models 4+ years of formal technical lead experience Experience leading major technical initiatives with cross-functional impact and influencing strategy across multiple teams Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI Hands-on experience on SQL and large-scale data handling, with familiarity of frameworks like Spark and Hive Programming experience in Python and hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow) Master's degree or PhD in Computer Science or a related technical field First-author publications at top peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) Experience working on frontier-quality/state-of-the-art Large Language or Large Media Models