Member of Technical Staff - Software Engineer (superintelligence Team)

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on building and operating the core platform infrastructure for training, evaluating, and deploying large-scale AI models within Microsoft. It involves designing scalable services for cluster orchestration, job scheduling, data pipelines, and artifact management, with a strong emphasis on production operations, cloud platforms (Azure), and enhancing developer experience for AI research and engineering teams.

What you'd actually do

  1. Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management.
  2. Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations; advocate for best practices in security, reproducibility, and cost efficiency.
  3. Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry).
  4. Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage.
  5. Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams.

Skills

Required

  • Strong software engineering fundamentals
  • Python
  • Kubernetes
  • Containers
  • Platform services development
  • Production operations
  • Cloud platforms (Azure, AWS, GCP)
  • Distributed systems
  • Networking
  • Storage

Nice to have

  • ML/AI platform infrastructure
  • GPU clusters
  • HPC
  • Large batch compute systems
  • Infrastructure-as-code (Terraform, Helm)
  • Observability tooling (Prometheus, Grafana, OpenTelemetry)
  • Internal developer tooling
  • Data pipeline orchestration (Airflow, Argo)
  • Container security
  • CI/CD
  • Reproducible deployments
  • Supporting AI research or model development teams

What the JD emphasized

  • large-scale, production systems
  • Hands-on ownership of production infrastructure
  • Python used as a primary language in production systems
  • Kubernetes + containers: deploying, operating, and supporting production workloads
  • Production operations ownership: monitoring, logging, alerting, and incident response
  • cloud platforms (Azure preferred; AWS/GCP acceptable)

Other signals

  • building infrastructure for training and evaluation
  • deploying AI models at scale
  • supporting product teams shipping AI