Principal Software Engineer - Performance

Microsoft Microsoft · Big Tech · Mountain View, CA +1 · Software Engineering

Principal Software Engineer focused on optimizing the performance of AI model inference, particularly LLMs, across various hardware platforms (GPUs, Microsoft silicon). The role involves deep technical work on the AI software stack, from fundamental abstractions to system-level optimizations, aiming to improve efficiency and reduce costs for large-scale AI deployments, including those for Azure OpenAI service.

What you'd actually do

  1. Identify and drive improvements to end-to-end inference performance of OpenAI and other state of the art LLMs
  2. Measure, benchmark performance on Nvidia/AMD GPU's and first party Microsoft silicon
  3. Optimize and monitor performance of LLMs and build SW tooling to enable insights into performance opportunities ranging from the model level to the systems and silicon level, help reduce the footprint of the computing fleet and achieve Azure AI capex goals
  4. Enable fast time to market of LLMs/models and their deployments at scale by building SW tools that afford velocity in porting models on new Nvidia, AMD GPUs and Maia silicon
  5. Design, implement, and test functions or components for our AI/DNN/LLM frameworks and tools

Skills

Required

  • C
  • C++
  • C#
  • Java
  • JavaScript
  • Python
  • software design and development skills
  • solving technical problems
  • building a full end-to-end AI stack

Nice to have

  • high performance applications
  • performance debug and optimization on CPU's/GPU's
  • software engineering principles
  • computer architecture
  • GPU architecture
  • HW neural net acceleration
  • end-to-end performance analysis and optimization of state of the art LLMs
  • HPC applications
  • GPU profiling tools
  • DNN/LLM inference
  • PyTorch
  • Tensorflow
  • ONNX Runtime
  • CUDA
  • ROCm
  • Triton

What the JD emphasized

  • performance
  • optimize performance
  • performance opportunities
  • performance debug and optimization
  • end-to-end performance analysis and optimization

Other signals

  • running AI models everywhere
  • inference performance of OpenAI and other state of the art LLM models
  • trillions of inferences per day
  • large scale training and inferencing of models
  • optimize performance at all levels of abstraction including kernel, model, algorithm and system level