Research Intern - Reliability of Cloud and AI Systems

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Applied Sciences

Research intern role focused on applying LLM and Agentic technology to improve the reliability of large-scale cloud and AI systems. The role involves analyzing production data, designing and building novel tools for monitoring and troubleshooting, and validating solutions on real Microsoft services.

What you'd actually do

  1. Push the boundaries: Apply cutting-edge Large Language Model (LLM) and Agentic technology to solve reliability challenges in cloud and AI systems.
  2. Innovate in failure diagnosis and prevention: Build novel tools for monitoring, logging, and troubleshooting at scale.
  3. Validate your ideas in the wild: Integrate and evaluate your solutions on real Microsoft services and incidents.

Skills

Required

  • PhD program in Computer Science or related STEM field
  • experience building scalable and reliable systems
  • ability to develop original research agenda
  • ability to collaborate effectively with other researchers and product development teams

Nice to have

  • Proficient interpersonal skills, cross-group, and cross-culture collaboration
  • Ability to think unconventionally to derive creative and innovative solutions

What the JD emphasized

  • cutting-edge Large Language Model (LLM) and Agentic technology
  • reliability challenges in cloud and AI systems
  • novel tools for monitoring, logging, and troubleshooting at scale
  • real Microsoft services and incidents

Other signals

  • applying LLM and Agentic technology to reliability challenges
  • innovate in failure diagnosis and prevention
  • integrate and evaluate solutions on real Microsoft services