Principal AI Network Architect

Microsoft Microsoft · Big Tech · Redmond, WA +4 · Software Engineering

This role focuses on the network architecture for AI accelerator platforms, specifically for high bandwidth and low latency networks critical for AI GPU clusters. The Principal AI Network Architect will evaluate, design, and optimize the network stack from hardware to software kernels, influencing Azure product roadmaps and working with state-of-the-art networking labs. The role requires deep expertise in networking technologies and familiarity with AI model execution pipelines.

What you'd actually do

  1. Spearhead architecture definition and evaluation of AI accelerator platforms, with a focus on high bandwidth, low latency networks.
  2. Drive end to end optimization of the stack from hardware, the software kernels.
  3. Partner with silicon and platform design teams to co-design infrastructure that meets performance, reliability and deployment goals.
  4. Frame decisions in terms of TCO, performance, flexibility, scalability.
  5. You will be working with state of art networking lab to prototype new network architectures.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Master’s or Doctoral degree in Electrical Engineering, Computer Engineering, or related fields and 10+ years of technical experience in the domain.
  • Deep expertise with ethernet networking, RDMA (RoCE, Infiniband), congestion control, and layer 2/3 switching.
  • Experience architecting scale-out/backend network for AI GPU clusters
  • Familiarity with scale-up networks such as NVLinks, UALink.
  • Experience with high radix ethernet switches
  • Familiarity with AI model execution pipelines, being able to analyze communication flows and its impact on model performance.
  • Prior contributions in standards committee and experience on hyperscale network deployments would be an added benefit
  • Skilled in partnering and influencing architects, hardware engineers, and software leads
  • Ability to manage through ambiguity, bringing clarity and results orientation to engage and energize collaborators and stakeholders
  • Collaboration skills, teamwork, and sense of presumed responsibility
  • Verbal and written communication skills, and ability to articulate and engage with both technical and non-technical stakeholders at all levels.
  • Experience leading and driving complex projects with respect and integrity, including those with multiple workstreams spanning different business and technical disciplines.
  • Intellectual curiosity and passion about learning and deploying new technologies.
  • Problem-solving skills, analytical capabilities, and attention to details

What the JD emphasized

  • high bandwidth, low latency networks
  • AI GPU clusters
  • AI model execution pipelines

Other signals

  • AI accelerator platforms
  • high bandwidth, low latency networks
  • Azure product roadmaps
  • AI GPU clusters
  • AI model execution pipelines