Member of Technical Staff, Pre-training Infrastructure - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on building and optimizing the software stack for massive GPU clusters, high-throughput storage systems, and cutting-edge AI research. You will work closely with model scientists to scale up the latest research recipes, implement new forms of distributed training parallelism, and ensure the reliability and performance of thousands of GPUs across our supercomputing fleet. Profiling, benchmarking, debugging, and fine-grained optimization are core to this role, demanding both engineering rigor and creativity.

What you'd actually do

  1. Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.
  2. Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
  3. Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies.
  4. Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond).
  5. Gather data and insights to develop the pretraining compute roadmap.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • Experience in distributed computing and large-scale systems.
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
  • Proven ability to profile, benchmark, and optimize performance-critical systems.
  • Experience in leading technical projects and supporting architectural decisions with data.
  • Experience building infrastructure for large-scale machine learning or generative AI workloads.
  • Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
  • Track record of contributing to high-performance computing or large-scale AI infrastructure projects.

What the JD emphasized

  • Pre-Training Infrastructure
  • frontier-scale models
  • training at unprecedented scale
  • massive GPU clusters
  • high-throughput storage systems
  • cutting-edge AI research
  • distributed training parallelism
  • thousands of GPUs
  • profiling, benchmarking, debugging, and fine-grained optimization
  • engineering rigor and creativity
  • core engineering group
  • architectural changes
  • roadmap for relevant software and hardware components
  • large-scale machine learning or generative AI workloads
  • networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • high-performance computing or large-scale AI infrastructure projects

Other signals

  • building and optimizing the software stack for massive GPU clusters
  • high-throughput storage systems
  • cutting-edge AI research
  • scale up the latest research recipes
  • implement new forms of distributed training parallelism
  • reliability and performance of thousands of GPUs
  • profiling, benchmarking, debugging, and fine-grained optimization
  • contributing member of the core engineering group
  • drive architectural changes
  • influence the roadmap for relevant software and hardware components