Principal Software Engineer

Microsoft Microsoft · Big Tech · United States · Software Engineering

This role focuses on managing and developing the core platform and fleet of AI High Performance Computing products for Microsoft Azure. The engineer will design and develop high-volume, low-latency telemetry pipelines to provide insights into customer-facing issues across the infrastructure stack, from datacenter events to hardware and networking subsystems. The goal is to improve job reliability and reduce job interrupts on flagship supercomputers used by top-tier AI customers. The role requires expertise in large-scale HPC & GPU systems, cloud computing, and high-performance data processing infrastructure.

What you'd actually do

  1. Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  2. Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  3. Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  4. Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  5. Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities

Skills

Required

  • C
  • C++
  • C#
  • Java
  • JavaScript
  • Python
  • AzPubSub
  • Event Hubs
  • Azure Stream Analytics
  • Kafka
  • Grafana
  • Prometheus
  • AI/HPC system management
  • High-Speed Networks
  • HPC Storage
  • Cloud Infrastructure

Nice to have

  • operating AI/HPC systems
  • developing and running AI/HPC applications on clusters
  • operating Cloud Infrastructure
  • DataCenter technologies: power, cooling, IT hardware, telemetry

What the JD emphasized

  • high volume low latency
  • customer facing issues
  • job interrupts
  • job reliability
  • datacenter events
  • hardware and networking subsystem events
  • large-scale High-Performance Computing & GPU systems
  • cloud computing platforms
  • high-performance data processing infrastructure
  • managing the largest scale of supercomputers