Research Intern - Aip AI Knowledge Multimodal AI

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Applied Sciences

Research Intern role focusing on multimodal AI, specifically the synergy between vision and language. The intern will explore LLMs, SLMs, and VLMs for tasks like video understanding, document analysis, and multi-page QA, with hands-on experience in leveraging LLMs for document understanding, grounding, and retrieval-based generation. The role involves prototyping, experimenting, and publishing research.

What you'd actually do

  1. Research Interns put inquiry and theory into practice.
  2. Alongside fellow doctoral candidates and some of the world’s best researchers, Research Interns learn, collaborate, and network for life.
  3. Research Interns not only advance their own careers, but they also contribute to exciting research and development strides.
  4. During the 12-week internship, Research Interns are paired with mentors and expected to collaborate with other Research Interns and researchers, present findings, and contribute to the vibrant life of the community.
  5. Research internships are available in all areas of research, and are offered year-round, though they typically begin in the summer.

Skills

Required

  • PhD program in Computer Vision, Natural Language Processing, Deep Learning, Machine Learning, AI, or a related field
  • At least 1 year of experience in LLM, NLP, computer vision, Deep learning, or multimodal research with hands-on deep learning experience

Nice to have

  • Proficient algorithmic problem solving and software development skills (Python, C/C++)
  • Experience with open-source tools such as PyTorch
  • Publication(s) in top-tier conferences or journals in related fields (e.g., ACL, CVPR, ECCV, ICCV, EMNLP, NAACL, NIPS, ICML, ICLR, IJCV, PAMI, etc.)

What the JD emphasized

  • multimodal AI
  • Large Language Models (LLMs)
  • Small Language Models (SLMs)
  • Vision-Language Models (VLMs)
  • video understanding
  • document layout analysis
  • chart interpretation
  • multi-page question answering
  • LLMs for document understanding
  • grounding
  • retrieval-based generation
  • PhD program
  • At least 1 year of experience in LLM, NLP, computer vision, Deep learning, or multimodal research with hands-on deep learning experience.

Other signals

  • multimodal AI
  • Large Language Models (LLMs)
  • Small Language Models (SLMs)
  • Vision-Language Models (VLMs)
  • video understanding
  • document layout analysis
  • chart interpretation
  • multi-page question answering
  • LLMs for document understanding
  • grounding
  • retrieval-based generation