Member of Technical Staff - Data Infra - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +2 · Data Engineering

The role focuses on building and maintaining data infrastructure for large-scale AI model training, specifically ingesting and processing enormous amounts of multimodal data (text, audio, images, video). This includes owning data pipelines, Spark, Ray, and Vector Databases, and partnering with pretraining and post-training teams to improve the data recipe for frontier models.

What you'd actually do

  1. Design and develop data pipelines that ingest enormous amounts of multi-modal training data (text, audio, images, video).
  2. Own and maintain critical data infrastructures, including spark, ray, vector databases, and others.
  3. Build and maintain cutting-edge infrastructure that can store and process the petabytes of data needed to power models.
  4. Partner with the pretraining and post-training teams to improve our data recipe by rigorous and careful experimentation.

Skills

Required

  • Bachelor’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling or data engineering work
  • OR Master’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ year(s) experience in business analytics, data science, software development, or data engineering work
  • OR equivalent experience.

Nice to have

  • Bachelor’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 8+ years experience in business analytics, data science, software development, data modeling or data engineering work
  • OR Master’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years of business analytics, data science, software development, data modeling or data engineering work experience
  • OR equivalent experience.

What the JD emphasized

  • multi-modal training data
  • petabytes of data
  • pretraining and post-training teams

Other signals

  • building data infrastructure for AI training
  • processing petabytes of multimodal data
  • partnering with pretraining and post-training teams