Research Engineer, Media Data Research - Msl Fair

Meta · Big Tech · Menlo Park, CA

Research Engineer role focused on building data foundations for Meta's advanced LLMs and LMMs, contributing to data curation across pre-training, mid-training, and post-training stages for various modalities. The role involves architecting scalable data systems, generating synthetic data, and working with trillion-scale datasets.

What you'd actually do

Collaborate with cross-functional teams to develop Meta’s next foundational models
Architect efficient and scalable data curation systems and pipelines
Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
Execute on high priority projects in pre-training, mid-training, or post-training data curation
Apply specialized expertise in video/image generation, video/image perception, OCR, data scaling laws, or data mixing

Skills

Required

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
2+ years of industry research experience in LLM/NLP, computer vision, or related AI/ML models
Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams
Practical experience with multimodal pre-training or mid-training data curation for large media perception or generation models
Demonstrated data infrastructure and software background, and experience building data tooling and services
Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI
Masters degree or PhD in Computer Science or a related technical field
Programming experience in Python
hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow)
Familiarity with SQL and file formats, such as Hive, Iceberg, Parquet, etc

Nice to have

equivalent practical experience
LLM/NLP
computer vision
related AI/ML models
technical lead
cross-functional impact
strategy across multiple teams
multimodal pre-training or mid-training data curation
large media perception or generation models
data infrastructure and software background
data tooling and services
leading peer-reviewed conferences
significant industry influence
AI
Masters degree or PhD
PyTorch or Spark
distributed computing frameworks
Ray, DataFlow
SQL
Hive, Iceberg, Parquet, etc

What the JD emphasized

trillion-scale
organic data curation
synthetic data generation
agent and interaction data
frontier paradigms
pre-training
mid-training
post-training
video/image generation
video/image perception
data scaling laws
data mixing
multimodal pre-training or mid-training data curation

Other signals

trillion-scale data curation
organic data curation
synthetic data generation
agent and interaction data
frontier paradigms

Read full job description

Meta is seeking AI research engineers to help us build the data foundation for Meta's most advanced Large Language and Media Models. We're looking for engineers with LLM/LMM expertise to join us on working with data at scale and to push beyond the data ceiling. Our team contributes to data curation across all stages of LLM/LMM development (pre-training, mid-training, post-training) and all domains/modalities (image, video, agent, media perception and generation). We are tackling complex challenges at trillion-scale, including organic data curation, synthetic data generation, agent and interaction data, and frontier paradigms that redefine what is possible. Based in Meta Superintelligence Labs (MSL) within the Fundamental AI Research Organization (FAIR), you'll directly contribute to Meta’s frontier models like Llama, while having the chance to collaborate with researchers and engineers across MSL.

Responsibilities

Collaborate with cross-functional teams to develop Meta’s next foundational models Architect efficient and scalable data curation systems and pipelines Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling Execute on high priority projects in pre-training, mid-training, or post-training data curation Apply specialized expertise in video/image generation, video/image perception, OCR, data scaling laws, or data mixing Lead complex technical projects end-to-end

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 2+ years of industry research experience in LLM/NLP, computer vision, or related AI/ML models Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams Practical experience with multimodal pre-training or mid-training data curation for large media perception or generation models Demonstrated data infrastructure and software background, and experience building data tooling and services Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI Experience working on frontier-quality/ state-of-the-art Large Language or Large Media Models Masters degree or PhD in Computer Science or a related technical field Programming experience in Python and hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow) Familiarity with SQL and file formats, such as Hive, Iceberg, Parquet, etc