Member of Technical Staff - Data Research Engineer - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

Seeking Data Research Engineers to join the Multimodal team, focusing on designing and curating high-quality datasets for next-generation foundation models across vision, language, and audio. The role involves developing data collection strategies, improving dataset quality, analyzing data-driven model behaviors, and building tools for dataset auditing, all within the context of responsible AI practices.

What you'd actually do

  1. Create high-quality datasets for training and evaluation; run experiments on new datasets (data ablations) to assess their impact and determine the most effective data.
  2. Develop and maintain scalable data pipelines for multimodal ingestion, preprocessing, filtering, and annotation.
  3. Analyze real-world multimodal datasets to assess quality, diversity, relevance, and identify areas for improvement.
  4. Build lightweight tools and workflows for dataset auditing, visualization, and versioning.
  5. Collaborate with Safety, Ethics, and Governance teams to ensure datasets meet standards for quality, privacy, and responsible AI practices.

Skills

Required

  • Bachelor's Degree in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline
  • 4+ years technical engineering experience
  • coding in languages including, but not limited to, Python
  • common data libraries (Pandas, NumPy, etc.)

Nice to have

  • Master's Degree in in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline
  • 8+ years technical engineering experience
  • 2+ years of experience in data analysis or data engineering
  • work with large-scale datasets that are unstructured or semi-structured
  • Proficiency in statistics and exploratory data analysis methods
  • Familiarity with data processing frameworks such as Spark, Ray, or Apache Beam
  • Ability to communicate technical findings clearly to research and product teams

What the JD emphasized

  • high-quality datasets
  • multimodal data
  • foundation models
  • frontier AI models
  • data collection strategies
  • dataset quality
  • data-driven model behaviors
  • multimodal ingestion
  • responsible AI practices

Other signals

  • foundation models
  • multimodal data
  • data collection strategies
  • dataset quality
  • data-driven model behaviors