Senior Product Manger - Tech, Infrastructure Reliability

Amazon Amazon · Big Tech · Austin, TX · Project/Program/Product Management--Technical

Product Manager for an AI-powered infrastructure reliability platform that uses LLMs and multi-agent systems to prevent, detect, and resolve incidents in Amazon's fulfillment network. The role involves defining product roadmaps, writing code for proof-of-concepts, and collaborating with data scientists and engineers on ML model applications, agent architecture, and evaluation frameworks.

What you'd actually do

  1. Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention.
  2. Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources.
  3. Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures.
  4. Apply your understanding of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds.
  5. Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities.

Skills

Required

  • Product management
  • Technical leadership
  • Machine learning fundamentals
  • LLM applications
  • Multi-agent systems
  • Infrastructure reliability
  • Incident management
  • Data analysis
  • Prototyping
  • Roadmap development
  • Cross-functional collaboration

Nice to have

  • Coding experience
  • Proof-of-concept development
  • Chain-of-thought prompting
  • Retrieval-augmented generation
  • Confidence calibration
  • Evidence accumulation
  • Agent orchestration
  • Observability
  • Predictive failure prevention

What the JD emphasized

  • write code
  • delivering proof-of-concepts
  • hands-on technical contributions
  • writing code and delivering working proof-of-concepts
  • use your technical skills
  • deep knowledge of machine learning fundamentals
  • understanding not just what a model produces but why, and whether that reasoning can be trusted in a production environment where self-governing remediation choices carry real operational risk
  • AI reasoning techniques
  • multi-agent architecture
  • self-governing agents act with appropriate confidence and escalate appropriately when uncertainty is high
  • prioritized backlog, making clear tradeoffs between feature depth, platform scalability, and autonomous site readiness milestones
  • measure platform performance against key metrics including auto-detection rate, false positive rate, consolidation accuracy, and remediation success rate, iterating rapidly based on data
  • cross-functional alignment
  • executive-level planning and prioritization
  • executive-level reviews

Other signals

  • AI-powered infrastructure reliability platform
  • LLMs, multi-agent systems, and machine learning
  • zero-touch incident resolution
  • predictive failure prevention
  • self-governing remediation orchestration