Principal Software Engineer, Coreai Workload Engines

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Software Engineering

Principal Software Engineer focused on building and optimizing foundational inference engines and APIs for large-scale AI inference across Azure. The role involves driving production-grade serving improvements for OpenAI and open-source LLMs, focusing on latency, throughput, availability, and cost efficiency. Responsibilities include making hands-on engine changes, building experimentation capabilities, and designing inference serving architectures to support multitenant AI systems at global scale.

What you'd actually do

  1. Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost).
  2. Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails.
  3. Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely.
  4. Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements.
  5. Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails.

Skills

Required

  • Inference engines
  • LLM serving
  • Performance optimization
  • Systems software
  • Cloud infrastructure
  • Experimentation frameworks
  • GPU optimization
  • Distributed systems
  • API design
  • Benchmarking
  • Profiling
  • Debugging

Nice to have

  • Python/PyTorch integration
  • Networking (RDMA/InfiniBand)
  • Quantization
  • Multi-modal utilization

What the JD emphasized

  • production-grade inference serving improvements
  • experimentation capabilities
  • large scale inferencing
  • production guardrails
  • serving availability and efficiency
  • production improvements

Other signals

  • LLM inference engines
  • large scale AI inference
  • GPU inference
  • OpenAI and OSS models
  • performance optimization
  • experimentation capabilities