Member of Technical Staff, Hardware Health - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · London, United Kingdom +2 · Software Engineering

This role is focused on ensuring the reliability, performance, and availability of Microsoft's large-scale AI training infrastructures, which involve tens of thousands of GPUs and advanced networking. The responsibilities include designing transport, fabric architecture, telemetry, observability, and automated troubleshooting for these clusters. The role also involves AI training and inference cluster bring-up, performance benchmarking, and root-cause analysis, with a goal of developing predictive health models and autonomous remediation systems.

What you'd actually do

  1. Advanced ROCE transport design, congestion control, ECN/WRED/DCTCP tuning
  2. Fabric architecture, topology planning, network modeling, and scaling strategy
  3. Telemetry, observability, reliability engineering, and automated troubleshooting
  4. Develop and tune the deployment of novel routing techniques to achieve reliability in large networks
  5. AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis

Skills

Required

  • C
  • C++
  • C#
  • Java
  • JavaScript
  • Python
  • telemetry
  • observability
  • reliability engineering
  • automated troubleshooting
  • network design
  • congestion control
  • routing techniques

Nice to have

  • Master's Degree
  • hardware health
  • AI training infrastructure
  • GPU clusters
  • NVLink/NVSwitch networks
  • liquid-cooling systems
  • ROCE
  • ECN
  • WRED
  • DCTCP
  • fabric architecture
  • network modeling
  • scaling strategy
  • AI inference

What the JD emphasized

  • sustained reliability, performance, and availability
  • predictive health models, failure detection frameworks, and autonomous remediation systems
  • AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis

Other signals

  • AI training + inference cluster bring-up
  • reliability, performance, and availability across exascale-class deployments
  • predictive health models, failure detection frameworks, and autonomous remediation systems