Member of Technical Staff, AI Systems Engineer - Microsoft Superintelligence

Microsoft Microsoft · Big Tech · Zürich, ZH, Switzerland · Software Engineering

The role focuses on integrating custom AI silicon with AI inference frameworks like SGLang, optimizing LLM inference performance, and developing custom operators. It involves working with hardware accelerators and potentially non-CUDA ecosystems, aiming to improve AI workload efficiency.

What you'd actually do

  1. Architect and develop the backend integration to make our custom AI chip a first-class citizen in SGLang.
  2. Write custom C++ / PyTorch extensions that map SGLang’s primitive operations (e.g., RadixAttention, FlashAttention, matrix multiplications) to our custom chip's proprietary software layer.
  3. Profile and optimize end-to-end LLM inference latency, throughput, and memory utilization (Paged Attention) on our hardware.
  4. Work closely with our hardware architecture and compiler teams to provide feedback on our custom software stack and silicon design based on framework-level bottlenecks.
  5. Build robust testing pipelines to validate model accuracy and performance parity against standard GPU baselines.

Skills

Required

  • Systems programming
  • ML infrastructure
  • AI compilers
  • Python (memory management, concurrent programming)
  • LLM Inference Engines (SGLang, vLLM, DeepSpeed-FastGen, or TensorRT-LLM)
  • PyTorch C++ extensions
  • Custom operators
  • Integrating machine learning workloads with hardware accelerators (GPUs, TPUs, NPUs) using custom SDKs, APIs, or low-level drivers

Nice to have

  • Non-CUDA software ecosystems (AMD ROCm, AWS Neuron, Google XLA)
  • AI compilers and intermediate representations (MLIR, Apache TVM, OpenAI Triton)
  • LLM architectures (Transformers, MoE)
  • Attention algorithms (FlashAttention v2/v3)
  • AI silicon startup experience
  • Custom accelerators (Google TPU, AWS Trainium)

What the JD emphasized

  • custom AI silicon
  • AI inference frameworks
  • SGLang
  • LLM inference latency, throughput, and memory utilization
  • hardware accelerators
  • non-CUDA software ecosystems

Other signals

  • custom AI silicon
  • AI inference frameworks
  • foundational AI infrastructure
  • large-scale training and inference
  • custom AI chip's proprietary SDK
  • SGLang
  • LLM inference latency, throughput, and memory utilization
  • hardware accelerators
  • non-CUDA software ecosystems