Member of Technical Staff, Data Research Engineer - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · London, United Kingdom +2 · Software Engineering

Seeking Data Research Engineers to join the Multimodal team, focusing on building next-generation foundation models. The role involves exploring, designing, and building high-quality multimodal datasets (vision, language, audio) for training and evaluation, collaborating with scientists and engineers, and developing scalable data pipelines. Responsibilities include data analysis, quality assessment, building tools for auditing, and ensuring datasets meet responsible AI practices.

What you'd actually do

  1. Create high-quality datasets for training and evaluation; run experiments on new datasets (data ablations) to assess their impact and determine the most effective data
  2. Develop and maintain scalable data pipelines for multimodal ingestion, pre-processing, filtering, and annotation
  3. Analyse real-world multimodal datasets to assess quality, diversity, relevance, and identify areas for improvement
  4. Build lightweight tools and workflows for dataset auditing, visualization, and versioning
  5. Collaborate with Safety, Ethics, and Governance teams to ensure datasets meet standards for quality, privacy, and responsible AI practices

Skills

Required

  • Bachelor's Degree in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or a related technical field AND technical engineering experience with coding in languages including, but not limited to, Python and common data libraries (Pandas, NumPy, etc.)
  • Experience in data analysis or data engineering
  • Proficiency in statistics and exploratory data analysis methods
  • Ability to communicate technical findings effectively to research and product teams

Nice to have

  • Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, Python and common data libraries (Pandas, NumPy, etc.)
  • Familiarity with data processing frameworks such as Spark, Ray, Apache Beam
  • Experience working with large-scale, real-world datasets that are unstructured or semi-structured

What the JD emphasized

  • multimodal datasets
  • responsible AI practices

Other signals

  • multimodal datasets
  • foundation models
  • data quality
  • data pipelines