Member of Technical Staff, High Performance Computing Engineer - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · London, United Kingdom +2 · Software Engineering

This role focuses on building and scaling the high-performance computing (HPC) infrastructure required for training frontier AI models and powering AI products like Copilot. The engineer will design, operate, and maintain large-scale HPC environments, including schedulers and core domains like GPU compute, storage, and networking. Responsibilities include developing automation tools, supporting researchers and engineers, and troubleshooting cluster issues to ensure efficient job scheduling and performance.

What you'd actually do

  1. Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.
  2. Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.
  3. Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.
  4. Develop and maintain automation and tooling using Bash and/or Python to improve cluster reliability, observability, and operational efficiency.
  5. Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.

Skills

Required

  • Bachelor’s degree in computer science, or related technical field
  • 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters
  • 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.)
  • 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP

Nice to have

  • Master's Degree in Computer Science or related technical field
  • 6+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters
  • 6+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.)
  • 6+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP
  • Experience with LLM training clusters
  • Experience working with AI platforms, frameworks, and APIs
  • Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.
  • Experience working with large-scale HPC or GPU systems (ex. NVIDIA H100/GB200 or equivalent).
  • Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.
  • Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.
  • Demonstrated interpersonal skills and ability to work closely with cross-functional teams, including product managers, desig

What the JD emphasized

  • high-scale training clusters
  • large-scale training clusters
  • LLM training clusters

Other signals

  • frontier models
  • large scale supercomputers
  • HPC environments
  • LLM training clusters