Principal Software Engineer, Coreai

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Principal Engineer on the AI Core Infrastructure team, responsible for large-scale GPU management infrastructure and inference/training platforms powering Microsoft's AI workloads. The role involves setting roadmaps, designing backend services, and providing insights for customers to monitor, troubleshoot, and scale AI training workloads on supercomputers. Focus on ML infrastructure, distributed systems, and observability.

What you'd actually do

  1. Set the roadmap and drive the execution of the training infrastructure built for AI workloads at a supercomputer scale.
  2. Design, develop and ship the backend services that power the AI workloads.
  3. Deliver deep insights that empower customers to troubleshoot and optimize their large-scale AI workloads
  4. Collaborate closely with engineers, data scientists across Microsoft’s internal research teams building models to shape the infrastructure.
  5. Leverage production telemetry to influence next-generation infrastructure design, boosting efficiency, reliability, and performance

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field
  • 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Excellent analytical and problem-solving skills
  • Expertise with distributed observability technologies (e.g., Prometheus, OpenTelemetry, Grafana)
  • 2+ years of experience designing or scaling telemetry pipelines for high-throughput production systems.
  • Advanced, hands-on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.
  • 6+ years of experience building or operating distributed systems, with a strong focus on reliability, scalability, and performance.
  • Understanding of Docker, Kubernetes, scalable architectures, and automation for production systems.

What the JD emphasized

  • large-scale
  • supercomputer scale
  • training infrastructure
  • inference and training platforms
  • monitor, troubleshoot, and scale their AI training workloads

Other signals

  • AI Core Infrastructure team
  • GPU management infrastructure
  • inference and training platforms
  • large-scale pre-training, post-training, and fine-tuning