Member of Technical Staff, LLM Inference - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on building and maintaining tools and systems for LLM inference, optimizing compute efficiency, and enabling researchers to run models for various tasks. It involves working with inference frameworks, GPU kernel programming, and distributed systems to improve model performance.

What you'd actually do

  1. Work alongside researchers and engineers to implement frontier AI research ideas.
  2. Introduce new systems, tools, and techniques to improve model inference performance.
  3. Build tools to help debug performance bottlenecks, numeric instabilities, and distributed systems issues.
  4. Build tools and establish processes to enhance the team’s collective productivity.
  5. Find ways to overcome roadblocks and deliver your work to users quickly and iteratively.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience

Nice to have

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience
  • Experience with generative AI
  • Experience with distributed computing
  • Python and Python ecosystem (eg. uv, pybind/nanobind, FastAPI) expertise
  • Experience with large scale production inference
  • Experience with GPU kernel programming
  • Experience benchmarking, profiling, and optimizing PyTorch generative AI models
  • Experience with open source inference frameworks like vLLM and SGLang
  • Working experience and conversant with the material in the JAX scaling book.

What the JD emphasized

  • implement frontier AI research ideas
  • improve model inference performance
  • debug performance bottlenecks
  • distributed systems issues
  • optimize them for inference
  • internals of open-source inference frameworks
  • large scale production inference
  • GPU kernel programming
  • benchmarking, profiling, and optimizing PyTorch generative AI models

Other signals

  • optimizing compute efficiency
  • enabling cutting-edge research and production deployment
  • vertically integrated, owning everything from kernels to architecture co-design to distributed systems to profiling and testing tools
  • optimizing them for inference
  • internals of open-source inference frameworks like vLLM and SGLang
  • large scale production inference
  • GPU kernel programming
  • benchmarking, profiling, and optimizing PyTorch generative AI models