Member of Technical Staff - Data Infra - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +3 · Data Engineering

The AI Data Infra team at Microsoft AI is responsible for building data infrastructure to help MAI teams to generate the biggest and best training dataset. This role focuses on designing and developing data pipelines for multimodal training data and owning critical data infrastructures like Spark, Ray, and Vector Databases.

What you'd actually do

  1. Design and develop data pipelines that ingest enormous amounts of multi-modal training data (text, audio, images, video).
  2. Own and maintain critical data infrastructures, including spark, ray, vector databases, and others.
  3. Build and maintain cutting-edge infrastructure that can store and process the petabytes of data needed to power models.
  4. Partner with the pretraining and post-training teams to improve our data recipe by rigorous and careful experimentation.

Skills

Required

  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 8+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience.
  • 4+ years experience with data governance, data compliance and/or data security.

Nice to have

  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 12+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 15+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience.

What the JD emphasized

  • data governance
  • data compliance
  • data security

Other signals

  • building data infrastructure
  • training AI frontier models
  • multimodal dataset