Principal Product Manager/architect - Foundry Inference Platform (coreai)

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Product Management

The Principal Product Manager/Architect will define and guide the technical architecture of Microsoft Foundry, an AI inferencing platform focused on reliability, scalability, and efficiency for large-scale GPU fleets. The role involves setting product direction for reliability, GPU fleet efficiency, capacity management, and engaging with strategic customers. Success metrics include platform reliability, GPU utilization, and customer outcomes.

What you'd actually do

  1. Own the product direction for Microsoft Foundry inference, with a primary mandate to make the platform the most reliable enterprise inferencing service available.
  2. Set the product direction for GPU fleet efficiency and capacity management, guiding platform-level design decisions that maximize utilization, minimize fragmentation, and accelerate time to monetization of new hardware and models.
  3. Act as a senior technical advisor and architect for Foundry’s most innovative and strategic customers, particularly those pushing the boundaries of scale, reliability, or model complexity.
  4. Serve as a unifying architectural voice across product management, engineering, infrastructure, and partner teams.

Skills

Required

  • Bachelor's Degree AND 10+ years experience in product/service/program management or software development OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Proven technical leadership with deep experience designing and operating planet-scale distributed systems, preferably in cloud, AI, or high-performance compute platforms.
  • Proven track record owning end-to-end architecture for mission-critical services with strong availability, resilience, and operational guarantees.
  • Deep understanding of GPU-backed inference systems, capacity management, scheduling

What the JD emphasized

  • end-to-end accountability for the product direction
  • deeply engaged in nearterm execution
  • primary mandate to make the platform the most reliable enterprise inferencing service available
  • architectural standards for global serving, multi-region resiliency, automated failover, and platform-managed disaster recovery
  • evolve the system from customer-managed resilience to platform-managed global reliability
  • architectural alignment across global routing, capacity pooling, observability, and control plane abstractions
  • reliability targets, SLAs, SLOs and recovery objectives are designed into the platform by default
  • maximize utilization, minimize fragmentation, and accelerate time to monetization of new hardware and models
  • architecture for global capacity pooling, intelligent scheduling, fungibility across workloads, automated demand forecasting, and software-defined allocation
  • deep optimization learnings into durable platform primitives, enabling sustained efficiency gains rather than one-off wins
  • influence architectural investments across inference utilization, model serving, and hardware/system performance
  • senior technical advisor and architect for Foundry’s most innovative and strategic customers
  • pushing the boundaries of scale, reliability, or model complexity
  • deep technical challenges, including large-scale model migrations, reliability-sensitive production deployments, and advanced serving architectures
  • articulating Foundry’s architectural advantages, turning bespoke requests into scalable features
  • customer feedback meaningfully influences platform roadmap and architectural priorities
  • operate at CTO/Chief Architect level with customers
  • unifying architectural voice across product management, engineering, infrastructure, and partner teams
  • Drive alignment on long-term technical direction, resolve architectural tradeoffs
  • connect technical design choices to business outcomes, including cost efficiency, customer trust, and platform differentiation
  • Proven technical leadership with deep experience designing and operating planet-scale distributed systems, preferably in cloud, AI, or high-performance compute platforms.
  • Proven track record owning end-to-end architecture for mission-critical services with strong availability, resilience, and operational guarantees.

Other signals

  • AI inferencing platform
  • large-scale GPU fleet management
  • reliability, efficiency, and customer trust at global scale
  • planet-scale distributed systems