What you'd actually do

Formulate, develop, and evaluate new algorithmic and system-level approaches for end-to-end AI serving, using analytical modeling and large-scale measurement to study token-level latency, tail latency (p95/p99), throughput-per-dollar, cold-start behavior, warm pool strategies, and capacity planning under multi-tenant SLOs and variable sequence lengths.

Design and experimentally evaluate endpoint configuration and execution policies, including batching, routing, and scheduling strategies, tensor and pipeline parallelism, quantization and precision profiles, speculative decoding, and chunked or streaming generation, and drive the most promising approaches through robust rollout and validation into production.

Perform hardware- and kernel-aware optimization by collaborating closely with model, kernel, compiler, and hardware teams to align serving algorithms with attention/KV innovations and accelerator capabilities.

Build and benchmark experimental prototypes and large-scale measurements to validate research ideas and drive them toward production readiness; produce clear technical documentation, design reviews, and operational playbooks.

Publish research results, file patents, and, where appropriate, contribute to open-source systems and serving frameworks.

Skills

Required

Doctorate in relevant field OR Master's Degree in relevant field AND 3+ years related research experience OR Bachelor's Degree in relevant field AND 4+ years related research experience OR equivalent experience.
Demonstrated expertise in areas of algorithmic optimization, parallel computing, queuing and scheduling theory, and practical request orchestration under strict SLO constraints.
Strong understanding of GPU architecture and memory hierarchies.
Proficiency in C++ and Python for high-performance systems, with strong code quality and profiling/debugging skills.
Proven record of research impact through publications and/or patents, and experience carrying ideas through to systems that operate at scale in real production environments.
Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

Deep understanding of transformer inference efficiency techniques such as sharding strategies, attention optimizations, paged KV caches, speculative decoding, LoRA, sequence packing or continuous batching, and quantization.
3+ years of experience with machine learning frameworks (e.g., PyTorch, TensorFlow) and inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime, Ray Serve, DeepSpeed-MII).
3+ years of experience in GPU programming and optimization, with expert knowledge of CUDA, ROCm, Triton, PTX, CUTLASS, or similar GPU programming frameworks.
Background in cost and performance modeling, autoscaling, and multi-region deployment or disaster recovery.

Overview

Generative AI is transforming how people create, collaborate, and communicate—redefining productivity across Microsoft 365 for customers worldwide. At Microsoft, we operate one of the largest collaboration and productivity platforms in the world, serving hundreds of millions of consumer and enterprise users. Delivering these AI experiences at scale requires solving some of the hardest efficiency challenges in modern AI systems.

We are an applied research team focused on advancing efficiency across the AI stack, spanning models, ML frameworks, cloud infrastructure, and hardware. We drive mid- and long-term product innovation through close collaboration with research and product teams across the company. We communicate our research both internally and externally through internal technical reports, academic conference publications, open-source releases, and patents. Beyond producing research, we take responsibility for driving ideas through prototyping, validation, and production, with a strong bias toward real-world impact.

The ideal Senior Researcher candidate will work across the full stack—from large-scale serving systems to hardware- and kernel-level optimizations—exploring algorithmic, systems, and hardware/software co-design techniques. Areas of focus include batching, routing, scheduling, caching, endpoint configuration, and GPU architecture–aware optimizations. This role emphasizes end-to-end ownership, with responsibility for identifying high-impact problems and driving research ideas through prototyping, validation, and deployment to deliver measurable customer impact.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

For more see: https://aka.ms/efficient-ai

Responsibilities

Formulate, develop, and evaluate new algorithmic and system-level approaches for end-to-end AI serving, using analytical modeling and large-scale measurement to study token-level latency, tail latency (p95/p99), throughput-per-dollar, cold-start behavior, warm pool strategies, and capacity planning under multi-tenant SLOs and variable sequence lengths.
Design and experimentally evaluate endpoint configuration and execution policies, including batching, routing, and scheduling strategies, tensor and pipeline parallelism, quantization and precision profiles, speculative decoding, and chunked or streaming generation, and drive the most promising approaches through robust rollout and validation into production.
Perform hardware- and kernel-aware optimization by collaborating closely with model, kernel, compiler, and hardware teams to align serving algorithms with attention/KV innovations and accelerator capabilities.
Build and benchmark experimental prototypes and large-scale measurements to validate research ideas and drive them toward production readiness; produce clear technical documentation, design reviews, and operational playbooks.
Publish research results, file patents, and, where appropriate, contribute to open-source systems and serving frameworks.

Qualifications

Required/Minimum Qualifications:

Doctorate in relevant field
- OR Master's Degree in relevant field AND 3+ years related research experience.
- OR Bachelor's Degree in relevant field AND 4+ years related research experience.
- OR equivalent experience.
Demonstrated expertise in areas of algorithmic optimization, parallel computing, queuing and scheduling theory, and practical request orchestration under strict SLO constraints.
Strong understanding of GPU architecture and memory hierarchies.
Proficiency in C++ and Python for high-performance systems, with strong code quality and profiling/debugging skills.
Proven record of research impact through publications and/or patents, and experience carrying ideas through to systems that operate at scale in real production environments.

**Other Requirements: **Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:

Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

Deep understanding of transformer inference efficiency techniques such as sharding strategies, attention optimizations, paged KV caches, speculative decoding, LoRA, sequence packing or continuous batching, and quantization.
3+ years of experience with machine learning frameworks (e.g., PyTorch, TensorFlow) and inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime, Ray Serve, DeepSpeed-MII).
3+ years of experience in GPU programming and optimization, with expert knowledge of CUDA, ROCm, Triton, PTX, CUTLASS, or similar GPU programming frameworks.
Background in cost and performance modeling, autoscaling, and multi-region deployment or disaster recovery.

#M365CORE

#M365RESEARCH

#RESEARCH

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**