Member of Technical Staff, Hardware Health - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on ensuring the reliability, performance, and availability of large-scale AI training infrastructures, specifically GPU clusters. It involves designing and developing hardware health monitoring and diagnostic frameworks, building predictive analytics pipelines using telemetry data, and leading incident triage for hardware anomalies. The goal is to drive automation in health management and partner with cross-functional teams to improve hardware design for reliability.

What you'd actually do

  1. Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
  2. Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
  3. Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
  4. Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
  5. Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience

Nice to have

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience
  • Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
  • Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
  • Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
  • Experience with exascale-class systems or cloud-scale AI clusters.
  • Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
  • Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.

What the JD emphasized

  • large-scale GPU systems
  • exascale-class systems
  • AI clusters
  • hardware health monitoring
  • diagnostic frameworks
  • predictive analytics
  • failure analysis

Other signals

  • AI training infrastructure
  • GPU clusters
  • predictive health models
  • failure detection frameworks
  • autonomous remediation systems