Principal Software Engineer

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Principal Software Engineer focused on optimizing GPU inference for large-scale deep learning models (LLMs/SLMs) within Microsoft's AI-native monetization platform, serving ads, shopping, and Copilot.

What you'd actually do

  1. Serves as the technological core of Microsoft's rapidly expanding digital advertising business.
  2. Focus on accelerating Microsoft’s large-scale deep learning inference for Ads, Shopping, Copilot, and other surfaces, including both offline and online applications that support OpenAI LLM models and next-generation LLMs/SLMs.
  3. Play a pivotal role in bridging state-of-the-art GPU and deep learning technologies with critical business applications.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience
  • Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience
  • Solid experience in GPU inference optimization (CUDA, TensorRT, Triton, or custom GPU kernels)
  • Proficiency in profiling tools (Nsight, TensorBoard, PyTorch profiler) and ability to identify CPU/GPU bottlenecks
  • Deep understanding of LLM/SLM architectures (attention, embeddings, MoE, decoders)
  • Experience optimizing latency‑critical online services
  • Experience with model compression (quantization, distillation, SVD, low‑rank methods)
  • Experience in building high‑throughput inference serving stacks (continuous batching, KV‑cache optimizations, routing)
  • Familiarity with Microsoft’s DLIS, Talon routing, Triton/TensorRT‑LLM stack, and Azure/H100/A100 GPU environments
  • Publications, competition wins, or real‑world deployments related to model efficiency

What the JD emphasized

  • GPU inference optimization
  • LLM/SLM architecture
  • latency-critical online services
  • high-throughput inference serving stacks

Other signals

  • GPU inference optimization
  • LLM/SLM architecture
  • latency-critical online services
  • high-throughput inference serving stacks