Mts - Site Reliability Engineer

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Software Engineering

This role is for a Site Reliability Engineer (SRE) focused on ensuring the reliability, availability, and efficiency of large-scale distributed AI infrastructure. The SRE will work with ML researchers, data engineers, and product developers to operate platforms for training, fine-tuning, and serving generative AI models. Key responsibilities include maintaining uptime, designing observability systems, optimizing performance, building automation for deployments and incident response, and ensuring security and compliance in hybrid cloud/on-prem CPU+GPU environments. The role requires strong experience in SRE/DevOps, Kubernetes, CI/CD, public cloud platforms, monitoring tools, and programming languages like Python or Go, with a preference for experience with large-scale GPU clusters and HPC.

What you'd actually do

  1. Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.
  2. Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.
  3. Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).
  4. Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.
  5. Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.

Skills

Required

  • 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Kubernetes
  • Docker
  • container orchestration
  • CI/CD pipelines for Inference and ML model deployment
  • public cloud platforms like Azure/AWS/GCP
  • infrastructure-as-code
  • monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Python
  • Go
  • Bash
  • distributed systems
  • networking
  • storage

Nice to have

  • Experience running large-scale GPU clusters for ML/AI workloads
  • Familiarity with ML training/inference pipelines
  • Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
  • Background in capacity planning & cost optimization for GPU-heavy environments

What the JD emphasized

  • large-scale GPU clusters for ML/AI workloads
  • capacity planning & cost optimization for GPU-heavy environments

Other signals

  • large-scale distributed AI infrastructure
  • training, fine-tuning, and serving generative AI models
  • ensure uptime, resiliency, and fault tolerance of AI model training and inference systems
  • monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
  • optimize resource utilization (compute, GPU clusters, storage, networking)
  • automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
  • experience running large-scale GPU clusters for ML/AI workloads
  • capacity planning & cost optimization for GPU-heavy environments