Principal Software Engineer, Coreai

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Software Engineering

This role focuses on building and operating the foundational GPU accelerated infrastructure for large-scale AI training and inference across Azure. It involves designing systems for GPU management, scheduling, isolation, and sharing, as well as optimizing performance, reliability, and utilization of GPU fleets. The role also requires driving end-to-end platform features, including observability and diagnostics, and influencing platform architecture.

What you'd actually do

  1. Design and build GPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments.
  2. Develop systems for GPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multitenant usage).
  3. Build and operate advanced orchestration and resource governance scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources.
  4. Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios.
  5. Optimize performance, reliability, and utilization across large GPU fleets, including scaleup and scale out configurations.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field
  • 6+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience
  • design and operate largescale, production infrastructure with high reliability and performance requirements
  • Strong problem-solving skills
  • ability to debug complex, cross layer systems issues
  • Demonstrated technical leadership
  • mentoring engineers
  • driving cross team architectural alignment
  • Hands-on experience with virtualization and/or container platforms (e.g., VMs, Kubernetes, container runtimes)
  • Strong collaboration and communication skills
  • ability to work across organizational boundaries

Nice to have

  • Familiarity with distributed training and inference stacks (e.g., NCCL style collectives, model/data parallelism)
  • Experience in building or operating multitenant AI platforms in cloud environments
  • Familiarity with high performance networking and low latency communication stacks
  • Familiarity with GPU accelerated computing (e.g., CUDA, GPU drivers, device plugins, or runtime integration)
  • Familiarity with GPU virtualization, passthrough, or partitioning technologies
  • Knowledge of confidential computing, trusted execution environments, or hardware-backed isolation

What the JD emphasized

  • largescale AI training and inference
  • multitenant AI systems
  • high reliability and performance requirements
  • complex, cross layer systems issues
  • large GPU fleets

Other signals

  • GPU infrastructure
  • AI training and inference
  • large scale AI systems
  • multitenant AI platforms