Research Engineer, Text Data Research - Msl Fair

Meta Meta · Big Tech · Menlo Park, CA

Meta is seeking AI research engineers to build the data foundation for their advanced LLMs. The role involves working with large-scale data across pre-training, mid-training, and post-training stages, focusing on areas like organic data curation, synthetic data generation, agent data, and frontier paradigms. This position is within Meta Superintelligence Labs (MSL) within FAIR, contributing to models like Llama.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Architect efficient and scalable data curation systems and pipelines
  3. Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  4. Execute on high priority projects in pre-training, mid-training, or post-training data curation
  5. Apply specialized expertise in agentic data, synthetic data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization

Skills

Required

  • LLM expertise
  • Large-scale data handling
  • Data curation systems and pipelines
  • Data tooling development
  • Agentic data
  • Synthetic data generation
  • Reasoning data
  • Web parser
  • Coding data
  • Data scaling laws
  • Datamix optimization
  • PyTorch
  • SQL
  • Spark
  • Hive
  • Industry research experience in LLM/NLP or related AI/ML models
  • Published research in leading peer-reviewed conferences or demonstrated significant industry influence

Nice to have

  • Masters degree or PhD in Computer Science or a related technical field
  • Experience as a formal technical lead
  • Experience working on frontier-quality/state-of-the-art Large Language Models

What the JD emphasized

  • 2+ years of industry research experience in LLM/NLP or related AI/ML models
  • Published research in leading peer-reviewed conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) and/or demonstrated significant industry influence in the field of AI

Other signals

  • data curation
  • LLM development
  • trillion-scale data
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms