Principal Software Engineer, Performance Tooling

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Software Engineering

Principal Software Engineer focused on AI performance tooling and validation for LLMs, including defining technical strategy, architecting benchmarking systems, leading performance investigations, and influencing stakeholders across Microsoft and OpenAI to optimize inference performance and hardware efficiency.

What you'd actually do

  1. Define and drive the technical strategy for AI performance tooling and validation across multiple layers of the AI software stack, including programming models, compilers, runtimes, libraries, and model-serving APIs.
  2. Architect scalable benchmarking and regression-detection systems for OpenAI and other state-of-the-art LLMs across GPUs, Microsoft accelerators, and emerging AI hardware platforms.
  3. Lead deep performance investigations across model architecture, kernels, runtimes, networking, scheduling, and hardware behavior, and guide teams toward durable optimizations for production-scale training and inference.
  4. Establish performance quality bars, readiness signals, and operating mechanisms that enable faster model and hardware bring-up while reducing regressions, deployment risk, and total hardware footprint.
  5. Influence and align senior engineers, researchers, product leaders, and hardware partners across Microsoft and OpenAI to deliver high-impact, production-ready AI performance improvements.

Skills

Required

  • C++
  • Python
  • distributed systems
  • AI inference/training workloads
  • accelerator-backed compute platforms

Nice to have

  • PyTorch
  • TensorFlow
  • ONNX Runtime
  • CUDA
  • ROCm
  • Triton
  • computer architecture
  • GPU/accelerator architecture
  • hardware/software co-design
  • profiling tools
  • tracing tools
  • observability tools

What the JD emphasized

  • performance tooling
  • performance validation
  • performance investigations
  • performance insights
  • performance improvements
  • performance analysis
  • performance

Other signals

  • Owns inference performance of OpenAI and other state of the art large language model (LLM) models
  • Work directly with OpenAI on the models hosted on the Azure OpenAI service
  • Lead the design of benchmarking and performance tooling systems used to evaluate OpenAI and other frontier LLMs
  • Identify systemic bottlenecks, and translate performance insights into production-ready improvements
  • Reduce time-to-deploy, improve hardware efficiency, and directly support Microsoft Azure's capex goals