Research Engineer (scaling Multimodal Data)

World Labs World Labs · AI Frontier · San Francisco, CA · Data & Model Training

Research Engineer focused on improving multimodal world models by building and refining data processing pipelines and experiments. This role involves discovering, evaluating, and acquiring training data, designing data processing systems, analyzing data quality, and closing the loop with model training and evaluation.

What you'd actually do

  1. Discover, evaluate, and acquire training data. You will find, evaluate, and integrate data from diverse sources. You will write scrapers, work with APIs, and make judgement calls about whether a source is worth pursuing before investing days of effort.
  2. Build data processing and curation systems. Design and implement data processing pipelines for filtering, deduplication, quality scoring, and curation. You will create well-abstracted systems that your teammates can pick up and extend.
  3. Look at the actual data constantly. You will sampling outputs, spotting distributional issues (e.g., too many screenshots, low-resolution crops, near-duplicates), and catch problems before they propagate to model training.
  4. Close the data model evaluation loop. You will diagnose model failures and trace them back to data issues, then design principled fixes to nip the problem in the bud.
  5. Deploy ML models for data enrichment. captioning, quality scoring, text embedding, segmentation, classification etc. You will evaluate whether these models actually help.

Skills

Required

  • Software engineering fundamentals
  • Image and video data processing at scale
  • Distributed computing (e.g., Apache Beam, Spark, Kubernetes, Ray)
  • ML model inference pipelines
  • Experimental design for data processing
  • Understanding of model training lifecycle and data impact

Nice to have

  • Columnar and large-scale data storage formats (PyArrow, Lance, Vortex, DeepMind Bagz)

What the JD emphasized

  • Strong software engineering fundamentals
  • Deep experience with image and video data at scale
  • Experience with distributed computing
  • Experience using ML models as components
  • A research-oriented approach to data decisions
  • Familiarity with the model training lifecycle
  • An overall obsession for the data-model-evaluation loop. You have demonstrated a track record of being obsessed with curating the best possible data to improve model performances and to prove that via rigorous evaluation, over and over again. You have a special knack that turns this obsession into successful data and model work.

Other signals

  • multimodal data
  • data processing pipelines
  • model training lifecycle
  • data-model-evaluation loop