Research Engineer, Media Data Research - Msl Fair

Meta Meta · Big Tech · Menlo Park, CA

Research Engineer role focused on building data foundations for Meta's advanced LLMs and LMMs, contributing to data curation across pre-training, mid-training, and post-training stages for various modalities. The role involves architecting scalable data systems, generating synthetic data, and working with trillion-scale datasets.

What you'd actually do

  1. Collaborate with cross-functional teams to develop Meta’s next foundational models
  2. Architect efficient and scalable data curation systems and pipelines
  3. Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  4. Execute on high priority projects in pre-training, mid-training, or post-training data curation
  5. Apply specialized expertise in video/image generation, video/image perception, OCR, data scaling laws, or data mixing

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 2+ years of industry research experience in LLM/NLP, computer vision, or related AI/ML models
  • Experience as a formal technical lead, leading major technical initiatives with cross-functional impact, and/or influencing strategy across multiple teams
  • Practical experience with multimodal pre-training or mid-training data curation for large media perception or generation models
  • Demonstrated data infrastructure and software background, and experience building data tooling and services
  • Published research in leading peer-reviewed conferences (e.g., ACL, NeurIPS, ICML, ICLR, AAAI, KDD, CVPR, ICCV) and/or demonstrated significant industry influence in the field of AI
  • Masters degree or PhD in Computer Science or a related technical field
  • Programming experience in Python
  • hands-on experience with frameworks like PyTorch or Spark, or related distributed computing frameworks (Ray, DataFlow)
  • Familiarity with SQL and file formats, such as Hive, Iceberg, Parquet, etc

Nice to have

  • equivalent practical experience
  • LLM/NLP
  • computer vision
  • related AI/ML models
  • technical lead
  • cross-functional impact
  • strategy across multiple teams
  • multimodal pre-training or mid-training data curation
  • large media perception or generation models
  • data infrastructure and software background
  • data tooling and services
  • leading peer-reviewed conferences
  • significant industry influence
  • AI
  • Masters degree or PhD
  • PyTorch or Spark
  • distributed computing frameworks
  • Ray, DataFlow
  • SQL
  • Hive, Iceberg, Parquet, etc

What the JD emphasized

  • trillion-scale
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms
  • pre-training
  • mid-training
  • post-training
  • video/image generation
  • video/image perception
  • data scaling laws
  • data mixing
  • multimodal pre-training or mid-training data curation

Other signals

  • trillion-scale data curation
  • organic data curation
  • synthetic data generation
  • agent and interaction data
  • frontier paradigms