Research Engineer (technical Leadership), Fair Data - Meta Superintelligence Labs

Meta Meta · Big Tech · Menlo Park, CA

Research Engineers at Meta Superintelligence Labs (FAIR) are responsible for building the data foundation for advanced Large Language and Media Models. This includes data curation across pre-training, mid-training, and post-training stages, focusing on challenges at trillion-scale such as organic and synthetic data generation, agent data, and frontier paradigms. The role involves architecting scalable data systems, improving data velocity, and applying expertise in various data modalities and domains.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Architect efficient and scalable data curation systems and pipelines
  3. Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  4. Execute on high priority projects in pre-training, mid-training, or post-training data curation
  5. Apply specialized expertise in video/image perception or generation, OCR, agentic data, synthetic data, multilingual data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization

Skills

Required

  • Python
  • PyTorch
  • Spark
  • SQL
  • large-scale data handling
  • data curation
  • pre-training
  • post-training
  • LLM expertise

Nice to have

  • Hive
  • Ray
  • DataFlow
  • video/image perception or generation
  • OCR
  • agentic data
  • synthetic data
  • multilingual data
  • reasoning data
  • web parser
  • coding data
  • data scaling laws
  • datamix optimization

What the JD emphasized

  • 4+ years of industry research experience with pre/mid/post-training data curation for large language or large media models
  • 4+ years of formal technical lead experience
  • Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI
  • First-author publications at top peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV)
  • Experience working on frontier-quality/state-of-the-art Large Language or Large Media Models

Other signals

  • data curation for LLMs
  • trillion-scale data
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms