AI Research Scientist, Media Data Research - Msl Fair

Meta Meta · Big Tech · Menlo Park, CA

AI Research Scientist focused on building the data foundation for Meta's advanced Large Language and Media Models. The role involves working with trillion-scale data, including organic and synthetic data generation, agent data, and frontier paradigms across pre-training, mid-training, and post-training stages. Expertise in multimodal data (image, video, agent) and collaboration with cross-functional teams are key.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Advance our understanding of data research, such as how to overcome data walls and how best to create synthetic data
  3. Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  4. Execute on high priority projects in pre-training, mid-training, or post-training data curation
  5. Apply specialized expertise in video/image generation, video/image perception, OCR, data scaling laws, or data mixing

Skills

Required

  • PhD in Computer Science or a related technical field
  • 2+ years of industry research experience in LLM/NLP, computer vision, or related AI/ML models
  • Practical experience with multimodal pre-training or mid-training data curation for large media perception or generation models
  • Programming experience in Python
  • hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow)
  • Familiarity with SQL and file formats, such as Hive, Iceberg, Parquet, etc

Nice to have

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams

What the JD emphasized

  • Published research in leading peer-reviewed conferences
  • First-author publications at top peer-reviewed conferences

Other signals

  • trillion-scale data curation
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms