Senior Data Engineer - AI Infrastructure

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Data Engineering

This role focuses on designing and implementing large-scale data pipelines and models for an AI infrastructure platform, processing terabytes to petabytes of data daily. The engineer will ensure data correctness, reliability, and usability for experimentation and analytics, working closely with data scientists and platform engineers.

What you'd actually do

  1. Design and implement large-scale data pipelines using PySpark and distributed processing frameworks
  2. Build and maintain data models that accurately represent underlying system behavior and business logic
  3. Ensure high standards of data correctness, completeness, and consistency across datasets
  4. Develop validation, monitoring, and alerting mechanisms to detect data quality issues
  5. Partner with data scientists to support experimentation and analytics use cases

Skills

Required

  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 3+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check

Nice to have

  • Experience with Azure technologies such as: ADLS Gen2 (Blob Storage), Synapse Spark, Azure Data Explorer (ADX)
  • Experience working with structured and semi-structured data (e.g., JSON logs)
  • Familiarity with experimentation and analytics workflows
  • Experience with orchestration tools (e.g., Airflow)
  • Exposure to privacy, compliance, and secure data handling practices
  • 5+ years of experience in data engineering or software engineering with a strong focus on data systems
  • Strong experience with PySpark or similar distributed data processing frameworks
  • Experience building and operating large-scale data pipelines
  • Strong understanding of data modeling and schema design
  • Experience ensuring data quality and correctness in production systems
  • Proficiency in Python
  • Experience working with cloud-based data platforms (Azure, AWS, or GCP)
  • Ability to reason about data at scale, including performance and failure modes

What the JD emphasized

  • large-scale data pipelines
  • data correctness
  • data quality

Other signals

  • large-scale data platform
  • raw system logs into high-quality, structured datasets
  • terabytes to petabytes of data daily
  • foundational asset for multiple teams
  • designing and implementing data pipelines
  • ensuring correctness
  • building scalable data models
  • work closely with data scientists and platform engineers
  • data is accurate, reliable, and usable for downstream decision-making
  • data correctness, understand how systems behave at scale
  • translate complex data into well-structured, reliable datasets