Senior Principal Engineering Manager

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Lead and grow a team building and operating world-class research compute infrastructure, including large-scale GPU clusters and agentic development tools, for Microsoft Research globally.

What you'd actually do

  1. Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure.
  2. Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management.
  3. Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality.
  4. Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development.
  5. Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Nice to have

  • 5+ years of people management experience leading software engineering teams, including managing principal engineers.
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads.
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability.
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments.
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise.
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team).
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience.

What the JD emphasized

  • AI research infrastructure
  • agentic development
  • agentic coding
  • agentic workflows

Other signals

  • AI infrastructure
  • GPU clusters
  • agentic development
  • research compute