Senior Product Manger - Tech, Infrastructure Reliability

Amazon Amazon · Big Tech · Austin, TX · Project/Program/Product Management--Technical

Senior Product Manager for an AI-powered infrastructure reliability platform within Amazon's Fulfillment Technologies & Robotics team. The role focuses on owning the roadmap for a platform that uses LLMs and multi-agent systems to prevent, detect, and resolve incidents across thousands of fulfillment sites. Requires hands-on technical contributions, including coding proof-of-concepts, and deep understanding of AI reasoning techniques, agent architecture, and ML fundamentals to ensure operational reliability and accelerate the move from idea to production.

What you'd actually do

  1. Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention.
  2. Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources.
  3. Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures.
  4. Apply your interpetation of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds.
  5. Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities.

Skills

Required

  • Product roadmap ownership
  • Technical product management
  • Machine learning fundamentals
  • AI reasoning techniques (chain-of-thought, RAG, confidence calibration, evidence accumulation)
  • Multi-agent system architecture
  • LLM application
  • Anomaly detection
  • Incident management
  • Cross-functional leadership
  • Data analysis and metrics tracking
  • Ability to write code and deliver proof-of-concepts

Nice to have

  • Experience in fulfillment technologies or robotics
  • Experience with infrastructure reliability platforms
  • Familiarity with observability tools

What the JD emphasized

  • write code
  • deliver proof-of-concepts
  • hands-on technical contributions
  • multi-agent systems
  • LLMs
  • AI-powered
  • self-governing
  • zero-touch
  • fully self-governing

Other signals

  • AI-powered infrastructure reliability platform
  • LLMs, multi-agent systems, and machine learning applied to operational platforms
  • Hands-on technical contributions to accelerate team's ability to move from idea to production
  • Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation
  • Prototype a multi-agent reasoning pipeline
  • Explore a new anomaly detection approach
  • Stress-testing an LLM prompt chain against real incident data
  • Apply your interpretation of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation
  • Define how LLMs are applied to diagnostic summarization, resolution suggestion, and automated stakeholder communication
  • Define agent roles, communication protocols, handoff conditions, and safety boundaries