Member of Technical Staff, Site Reliability Engineer (hpc) - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +1 · Software Engineering

The role is for a Site Reliability Engineer (SRE) focused on High Performance Computing (HPC) infrastructure for AI model training and inference. The engineer will ensure the reliability, availability, and efficiency of large-scale distributed AI systems, including GPU clusters, and will be involved in monitoring, automation, incident management, and security.

What you'd actually do

  1. Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference.
  2. Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking.
  3. Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments.
  4. Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.
  5. Ensure data privacy, compliance, and secure operations across model training and serving environments.

Skills

Required

  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR equivalent experience

Nice to have

  • Kubernetes, Docker, and container orchestration
  • CI/CD pipelines for Inference and ML model deployment
  • public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Python, Go, or Bash
  • distributed systems, networking, and storage
  • large-scale GPU clusters for ML/AI workloads
  • ML training/inference pipelines
  • high-performance computing (HPC) and workload schedulers (Kubernetes operators)
  • capacity planning & cost optimization for GPU-heavy environments

What the JD emphasized

  • large-scale GPU clusters
  • model training and inference

Other signals

  • large-scale distributed AI infrastructure
  • AI model training and inference
  • GPU clusters