Software Engineer II

Microsoft Microsoft · Big Tech · United States · Software Engineering

Software Engineer II role on the Microsoft Azure High Performance Computing & AI Engineering team, focusing on managing and operating the core platform and fleet of AI and High Performance Computing products. The role involves designing and developing capabilities to monitor and efficiently operate supercomputers at scale, creating data pipelines for telemetry and logs to generate alerts, and implementing systemic solutions to improve performance and reliability. The position emphasizes a live-site first, metrics-driven culture with a focus on customer experience and incident management.

What you'd actually do

  1. Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers.
  2. Manages operations of supercomputers by responding quickly to mitigate issues.
  3. Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
  4. Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents.
  5. Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Bachelor's Degree in Computer Science
  • 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • Master's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

What the JD emphasized

  • live-site first
  • metrics-driven culture
  • customer experience
  • critical incidents
  • actionable alerts
  • performance
  • functionality
  • observability
  • reliability
  • efficiency