Member of Technical Staff, AI Networking - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · London, United Kingdom +2 · Software Engineering

This role focuses on designing, scaling, and optimizing high-performance networks for AI training and inference clusters. The engineer will work on the end-to-end networking architecture, from link-layer to fabric-wide systems, connecting thousands of GPUs. Responsibilities include benchmarking, profiling, debugging, and tuning AI workloads, engineering ultra-low-latency networks, and designing congestion-free transport mechanisms. The goal is to build networking systems that directly accelerate Microsoft's frontier AI models and support the development of advanced AI systems.

What you'd actually do

  1. Advanced ROCE transport design, congestion control, ECN/WRED/DCTCP tuning
  2. Fabric architecture, topology planning, network modeling, and scaling strategy
  3. Telemetry, observability, reliability engineering, and automated troubleshooting
  4. Develop and tune the deployment of novel routing techniques to achieve reliability in large networks
  5. Work with world class network designers like NVIDIA, Broadcom, and in-house silicon/network co-design teams

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience
  • Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

What the JD emphasized

  • scale the world’s most advanced high-performance networks
  • enables multi-gigawatt AI supercomputers
  • supports the training of the most sophisticated AI models on the planet
  • design, bring up, and scale the distributed Ethernet and InfiniBand fabrics that connect hundreds of thousands of GPUs
  • AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis
  • develop the pretraining compute roadmap

Other signals

  • building the fabric that connects frontier-class datacenters
  • enables multi-gigawatt AI supercomputers
  • supports the training of the most sophisticated AI models on the planet
  • design, bring up, and scale the distributed Ethernet and InfiniBand fabrics that connect hundreds of thousands of GPUs
  • AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis
  • gather data and insights to develop the pretraining compute roadmap