Member of Technical Staff - Principal Data Infrastructure Engineer

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Data Engineering

This role is a Principal Data Infrastructure Engineer focused on enabling large-scale data and ML pipelines and intelligent systems for consumer AI. It involves architecting and maintaining scalable, reliable, and observable Big Data Infrastructure, championing DevOps and SRE best practices, and building a self-service big data platform. The role requires collaboration with Data Engineers, Data Scientists, AI Researchers, and Developers to deliver secure, seamless big data workflows and optimize system performance and security. While not directly building AI models, the role is critical for supporting AI applications and infrastructure.

What you'd actually do

  1. Architect and maintain scalable, reliable, and observable Big Data Infrastructure for mission-critical AI applications.
  2. Champion DevOps and SRE best practices—automated deployments, service monitoring, and incident response.
  3. Build a self-service big data platform that empowers data and platform engineers and researchers.
  4. Develop robust CI/CD pipelines and automate infrastructure provisioning using Infrastructure as Code tools (Bicep, Terraform, ARM).
  5. Collaborate with Data Engineers, Data Scientists, AI Researchers, and Developers to deliver secure, seamless big data workflows.

Skills

Required

  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience.

Nice to have

  • 4+ years in Big Data Infrastructure, DevOps, SRE, or Platform Engineering.
  • 3+ years of hands-on experience managing and scaling distributed systems—from bare-metal to cloud-native environments.
  • 2+ years deploying containerized applications using Kubernetes and Helm/Kustomize.
  • Solid scripting and automation skills using Python, Bash, or PowerShell.
  • Proven success in CI/CD pipeline management, release automation, and production troubleshooting.
  • Experience working with Databricks for scalable data processing and analytics.
  • Familiarity with security practices in infrastructure environments, including IAM, OAuth, and Kerberos administration.
  • Proven experience with cloud-native infrastructure across Azure, AWS, or GCP.
  • Hands-on expertise with modern data platforms like Databricks, including:
  • Deep understanding of data storage and processing technologies:
  • Relational & NoSQL databases
  • Key-value stores.
  • Spark compute engines.
  • Distributed file systems (e.g., HDFS, ADLS Gen2).
  • Messaging systems (e.g., Event Hub, Kafka, RabbitMQ).
  • Capacity planning and incident management for large-scale big data systems.
  • Solid collaboration history with Data Engineers, Data Scientists, ML Engineers, Networking, and Security teams.
  • Familiarity with modern web stacks: TypeScript, Node.js, React, and optionally PHP.
  • Exposure to agentic workflows, deep learning, or AI frameworks.
  • Practical experience integrating LLMs (e.g., GPT-based models) into daily workflows—automating documentation, code generation, reviews, and operational intelligence.
  • Solid grasp of prompt engineering techniques to design, optimize, and evaluate interactions with LLMs.
  • Demonstrated ability to troubleshoot and resolve complex performance and scalability issues across infrastructure layers.
  • Excellent interpersonal and communication skills, with a solid passion for mentorship and continuous learning.
  • Experience applying LLMs to DevOps workflows, enhancing incident response, and streamlining cross-functional collaboration is a solid advantage.

What the JD emphasized

  • mission-critical AI applications
  • big data platform
  • big data workflows
  • data pipelines and infrastructure
  • system performance
  • system security
  • big data infrastructure
  • distributed systems
  • data processing and analytics
  • data storage and processing technologies
  • large-scale big data systems
  • AI frameworks
  • LLMs into daily workflows
  • LLMs to DevOps workflows