AI Research Scientist, Media Data Research - Msl Fair

Meta Meta · Big Tech · Menlo Park, CA

AI Research Scientist focused on building the data foundation for Meta's advanced Large Language and Media Models. The role involves data curation across pre-training, mid-training, and post-training stages, handling trillion-scale data challenges including organic and synthetic data generation, and contributing to frontier paradigms. Collaboration with cross-functional teams and leading complex technical projects are key.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Advance our understanding of data research, such as how to overcome data walls and how best to create synthetic data
  3. Fundamentally improve our data velocity across workflows and projects by contributing to quality in data tooling
  4. Execute on high priority projects in pre-training, mid-training, or post-training data curation
  5. Apply specialized expertise in video/image generation, video/image perception, OCR, data scaling laws, or data mixing
  6. Lead complex technical projects end-to-end

Skills

Required

  • PhD in Computer Science or a related technical field
  • 1+ year of industry research experience in LLM/LMM, computer vision, or related AI/ML models
  • Practical experience with multimodal pre-training or mid-training data curation for large media perception or generation models
  • Programming experience in Python
  • hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow)
  • Familiarity with SQL and file formats, such as Hive, Iceberg, Parquet, etc

Nice to have

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience owning and/or driving complex technical projects from end-to-end
  • Experience working on frontier-quality/ state-of-the-art Large Language or Large Media Models

What the JD emphasized

  • Published research in leading peer-reviewed conferences
  • First-author publications at top peer-reviewed conferences

Other signals

  • LLM/LMM expertise
  • data curation at scale
  • trillion-scale data challenges
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms