Product Manager II

Microsoft Microsoft · Big Tech · United States · Product Management

Product Manager II for Azure HPC/AI team focusing on networking for large-scale AI training supercomputers. The role involves working with AI compute product managers, AI model developers, and datacenter experts to ensure operational uptime and workload throughput for AI supercomputers. The goal is to support the training of advanced AI models for consumer and enterprise services.

What you'd actually do

  1. Drive, track, and publish success criteria for backend networking of ultra large scale AI supercomputers. Your primary objective, shared with colleagues and partner teams, is to drive maximum operational uptime and AI workload throughput of some of the largest supercomputers on the planet.
  2. Identify leading and/or unique points of failure affecting your primary goal and associated KPIs, and drive remediations and roadmap changes to address those issues.
  3. Work across and build trust among a V-team of supercomputing product groups, datacenter site operators, quality control specialists, vendors, business leaders, and customers to achieve your objectives.

Skills

Required

  • 2+ years experience in product/service/project/program management or software development
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • 5 years experience in operating production supercomputers
  • 5 years experience improving product metrics for a product, feature, or experience in a market
  • Familiarity with RoCE v2, InfiniBand, UCX, MPI, NCCL, RCCL, and distributed memory compute workloads
  • Ability to work overlapping hours with East Coast teams (EST)