Research Intern - Systems for Efficient AI

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Applied Sciences

Research intern focused on optimizing AI inference systems, including LLM inference, KV caching, request scheduling, and GPU orchestration, to improve latency, throughput, and cost-efficiency.

What you'd actually do

  1. Research Interns put inquiry and theory into practice.
  2. Alongside fellow doctoral candidates and some of the world’s best researchers, Research Interns learn, collaborate, and network for life.
  3. Research Interns not only advance their own careers, but they also contribute to exciting research and development strides.
  4. During the 12-week internship, Research Interns are paired with mentors and expected to collaborate with other Research Interns and researchers, present findings, and contribute to the vibrant life of the community.
  5. Research internships are available in all areas of research, and are offered year-round, though they typically begin in the summer.

Skills

Required

  • Accepted or currently enrolled in a PhD program in Computer Science, Software Engineering, Electrical Engineering, or a related STEM field

Nice to have

  • LLM architectures
  • systems for LLM inference
  • AI hardware
  • GPUs
  • CUDA/ROCm frameworks
  • computer systems
  • networks
  • conducting research
  • writing peer-reviewed publications
  • Proficient written and verbal communication skills
  • cross-functional and multi-disciplinary setting across research and product
  • Proficient software development skills
  • C++
  • Python

What the JD emphasized

  • PhD program in Computer Science, Software Engineering, Electrical Engineering, or a related STEM field
  • Experience with LLM architectures, systems for LLM inference, and/or AI hardware
  • Experience with GPUs and understanding of CUDA/ROCm frameworks
  • Experience in conducting research and writing peer-reviewed publications

Other signals

  • AI inference optimization
  • LLM inference
  • KV caching optimizations
  • request scheduling/batching mechanisms
  • GPU fleet orchestration