Senior Software Engineer

Microsoft Microsoft · Big Tech · United States · Software Engineering

This role is for a Senior Software Engineer on the Microsoft Azure High Performance Computing & AI Engineering team, responsible for managing the core platform and fleet of AI High Performance Computing products. The role involves designing and developing capabilities to monitor and operate supercomputers at scale, diagnosing and troubleshooting large-scale systems, and creating data pipelines for telemetry and alerts. While the role supports AI customers and uses AI HPC products, the core craft is infrastructure engineering and operations, not direct AI/ML model development.

What you'd actually do

  1. Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers.
  2. Manages operations of supercomputers by responding quickly to mitigate issues.
  3. Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
  4. Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents.
  5. Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Bachelor's Degree in Computer Science OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • Experience diagnosing and troubleshooting GPU based systems such as H100, A100 or networking technologies such as InfiniBand or Ethernet.
  • Experience with large scale data pipelines using tools such as Prometheus, Grafana, etc.

What the JD emphasized

  • live-site first
  • metrics-driven culture
  • diagnose and troubleshoot the largest scale supercomputing systems
  • create end to end data pipelines that process and synthesize large volume of telemetry, log files and other data sources to create actionable alerts
  • observability