What you'd actually do

Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management.

Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations; advocate for best practices in security, reproducibility, and cost efficiency.

Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry).

Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage.

Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams.

Skills

Required

Strong software engineering fundamentals
Python
Kubernetes
Containers
Platform services development
Production operations
Cloud platforms (Azure, AWS, GCP)
Distributed systems
Networking
Storage

Nice to have

ML/AI platform infrastructure
GPU clusters
HPC
Large batch compute systems
Infrastructure-as-code (Terraform, Helm)
Observability tooling (Prometheus, Grafana, OpenTelemetry)
Internal developer tooling
Data pipeline orchestration (Airflow, Argo)
Container security
CI/CD
Reproducible deployments
Supporting AI research or model development teams

What the JD emphasized

large-scale, production systems

Hands-on ownership of production infrastructure

Python used as a primary language in production systems

Kubernetes + containers: deploying, operating, and supporting production workloads

Production operations ownership: monitoring, logging, alerting, and incident response

cloud platforms (Azure preferred; AWS/GCP acceptable)

Overview

Help build the infrastructure that powers training, evaluation, and data platforms for reliable deployment of world-class foundational AI models. We are on a mission to create state-of-the-art AI models and deploy them across Microsoft products at an unprecedented scale.

You’ll collaborate across engineering and research to design, evolve, and operate core research infrastructure, so that product teams can train faster, evaluate more rigorously, and ship with confidence. You’ll work closely with the teams that transform pre-trained models into the consumer Copilot experience.

Microsoft’s mission is to empower every person and every organization to achieve more, and we build on values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive.

Microsoft Superintelligence Team

This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward** Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values.** Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.

We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!

Responsibilities

Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management.
Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations; advocate for best practices in security, reproducibility, and cost efficiency.
Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry).
Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage.
Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams.
Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management.
Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals.

Qualifications

Required qualifications

Strong software engineering fundamentals building large‑scale, production systems
Hands‑on ownership of production infrastructure (not just usage)
Python used as a primary language in production systems
Kubernetes + containers: deploying, operating, and supporting production workloads
Experience building or operating platform services (job orchestration, data/compute pipelines, shared infra)
Production operations ownership: monitoring, logging, alerting, and incident response
Experience on cloud platforms (Azure preferred; AWS/GCP acceptable)
Comfortable with distributed systems, networking, and storage

Desired qualifications

Built or owned ML / AI platform infrastructure (training, evaluation, experiment pipelines)
Experience with GPU clusters, HPC, or large batch compute systems
Infrastructure‑as‑code experience (Terraform, Helm)
Observability tooling (Prometheus, Grafana, OpenTelemetry)
Internal developer tooling experience (CLIs, portals, job submission tools)
Data pipeline orchestration (Airflow, Argo, streaming systems)
Container security, CI/CD, and reproducible deployments
Experience supporting AI research or model development teams

Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**

Required qualifications

Strong software engineering fundamentals building large‑scale, production systems

Hands‑on ownership of production infrastructure (not just usage)

Python used as a primary language in production systems

Kubernetes + containers: deploying, operating, and supporting production workloads

Experience building or operating platform services (job orchestration, data/compute pipelines, shared infra)

Production operations ownership: monitoring, logging, alerting, and incident response

Experience on cloud platforms (Azure preferred; AWS/GCP acceptable)

Comfortable with distributed systems, networking, and storage

Desired qualifications

Built or owned ML / AI platform infrastructure (training, evaluation, experiment pipelines)

Experience with GPU clusters, HPC, or large batch compute systems

Infrastructure‑as‑code experience (Terraform, Helm)

Observability tooling (Prometheus, Grafana, OpenTelemetry)

Internal developer tooling experience (CLIs, portals, job submission tools)

Data pipeline orchestration (Airflow, Argo, streaming systems)

Container security, CI/CD, and reproducible deployments

Experience supporting AI research or model development teams

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Member of Technical Staff - Software Engineer (superintelligence Team)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Required qualifications

Desired qualifications

Required qualifications

Desired qualifications