Senior Researcher - Efficient AI

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Research Sciences

Applied research role focused on advancing efficiency across the AI stack for large-scale generative AI experiences in Microsoft 365. The role involves optimizing AI serving systems from algorithmic and systems levels down to hardware and kernel optimizations, with a focus on end-to-end ownership from research to production deployment.

What you'd actually do

  1. Formulate, develop, and evaluate new algorithmic and system-level approaches for end-to-end AI serving, using analytical modeling and large-scale measurement to study token-level latency, tail latency (p95/p99), throughput-per-dollar, cold-start behavior, warm pool strategies, and capacity planning under multi-tenant SLOs and variable sequence lengths.
  2. Design and experimentally evaluate endpoint configuration and execution policies, including batching, routing, and scheduling strategies, tensor and pipeline parallelism, quantization and precision profiles, speculative decoding, and chunked or streaming generation, and drive the most promising approaches through robust rollout and validation into production.
  3. Perform hardware- and kernel-aware optimization by collaborating closely with model, kernel, compiler, and hardware teams to align serving algorithms with attention/KV innovations and accelerator capabilities.
  4. Build and benchmark experimental prototypes and large-scale measurements to validate research ideas and drive them toward production readiness; produce clear technical documentation, design reviews, and operational playbooks.
  5. Publish research results, file patents, and, where appropriate, contribute to open-source systems and serving frameworks

Skills

Required

  • Doctorate in relevant field OR Master's Degree in relevant field AND 3+ years related research experience OR Bachelor's Degree in relevant field AND 4+ years related research experience OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Designing and optimizing efficient inference systems
  • algorithmic optimization
  • parallel computing
  • request orchestration
  • attention and KV‑cache optimizations
  • batching and scheduling strategies
  • cost‑aware deployment
  • machine learning frameworks (e.g., PyTorch, TensorFlow)
  • inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime, Ray Serve, DeepSpeed-MII)
  • GPU programming and optimization
  • CUDA
  • ROCm
  • Triton
  • PTX
  • CUTLASS
  • C++
  • Python for high-performance systems
  • code quality and profiling/debugging skills
  • Research impact through publications and/or patents
  • hands‑on experience taking research ideas through execution and delivery in production

What the JD emphasized

  • end-to-end AI serving
  • hardware- and kernel-level optimizations
  • production readiness
  • production

Other signals

  • end-to-end AI serving
  • algorithmic and systems optimization
  • hardware and kernel-level optimizations
  • production deployment