Member of Technical Staff, Hpc Operations Engineering Manager

Microsoft Microsoft · Big Tech · Mountain View, CA +1 · Software Engineering

This role manages a team of Site Reliability Engineers responsible for the reliability and efficiency of large-scale distributed AI infrastructure, specifically for training, fine-tuning, and serving generative AI models. The focus is on leading operations, observability, automation, incident management, and security within hybrid cloud/on-prem CPU+GPU environments, collaborating closely with ML engineers and platform teams.

What you'd actually do

  1. Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems.
  2. Design and help maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.
  3. Lead building of automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.
  4. Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.
  5. Ensure data privacy, compliance, and secure operations across model training and serving environments.

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • Infrastructure Engineering Leadership
  • Kubernetes
  • Docker
  • container orchestration
  • Python
  • Go
  • Bash
  • people management experience

Nice to have

  • public cloud platforms like Azure/AWS/GCP
  • infrastructure-as-code
  • monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • CI/CD pipelines for Inference and ML model deployment
  • distributed systems
  • networking
  • storage
  • large-scale GPU clusters for ML/AI workloads
  • ML training/inference pipelines
  • high-performance computing (HPC)
  • workload schedulers (Kubernetes operators)
  • capacity planning
  • cost optimization for GPU-heavy environments

What the JD emphasized

  • AI model training and inference systems
  • model serving pipelines and infra
  • hybrid cloud/on-prem CPU+GPU environments
  • ML training/inference pipelines
  • GPU-heavy environments

Other signals

  • leading a team of SREs
  • operating large-scale distributed AI infrastructure
  • partnering with ML engineers and platform teams