What you'd actually do

Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention.

Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources.

Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures.

Apply your understanding of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds.

Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities.

Skills

Required

Product management
Technical leadership
Machine learning fundamentals
LLM applications
Multi-agent systems
Infrastructure reliability
Incident management
Data analysis
Prototyping
Roadmap development
Cross-functional collaboration

Nice to have

Coding experience
Proof-of-concept development
Chain-of-thought prompting
Retrieval-augmented generation
Confidence calibration
Evidence accumulation
Agent orchestration
Observability
Predictive failure prevention

What the JD emphasized

write code

delivering proof-of-concepts

hands-on technical contributions

writing code and delivering working proof-of-concepts

use your technical skills

deep knowledge of machine learning fundamentals

understanding not just what a model produces but why, and whether that reasoning can be trusted in a production environment where self-governing remediation choices carry real operational risk

AI reasoning techniques

multi-agent architecture

self-governing agents act with appropriate confidence and escalate appropriately when uncertainty is high

prioritized backlog, making clear tradeoffs between feature depth, platform scalability, and autonomous site readiness milestones

measure platform performance against key metrics including auto-detection rate, false positive rate, consolidation accuracy, and remediation success rate, iterating rapidly based on data

cross-functional alignment

executive-level planning and prioritization

executive-level reviews

Join Amazon's Fulfillment Technologies & Robotics (FTR) team to spearhead the product vision for a platform that ensures Amazon's fulfillment network never stops — even as we move toward fully self-governing, zero-touch operations. You'll own the roadmap for an AI-powered infrastructure reliability platform that prevents, detects, and resolves incidents across thousands of fulfillment sites globally.

This is a rare opportunity for a technically deep product leader who can write code, deliver proof-of-concepts, and engage as a peer with data scientists and engineers. You will shape how LLMs, multi-agent systems, and machine learning are applied to one of the most operationally critical platforms Amazon has ever built — and your hands-on technical contributions will directly accelerate the team's ability to move from idea to production.

Key job responsibilities

Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention. This means defining the vision, strategy, and success metrics for AI-powered progressive detection, incident consolidation, self-governing remediation orchestration, and cross-domain observability capabilities that serve thousands of fulfillment sites globally.
Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources. Whether prototyping a multi-agent reasoning pipeline, exploring a new anomaly detection approach, or stress-testing an LLM prompt chain against real incident data, you will use your technical skills to compress the distance between idea and validated direction.
Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures. You will engage meaningfully with data scientists on model architecture selections, feature engineering tradeoffs, and evaluation frameworks — understanding not just what a model produces but why, and whether that reasoning can be trusted in a production environment where self-governing remediation choices carry real operational risk.
Apply your understanding of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds. You will shape how LLMs are applied to diagnostic summarization, resolution suggestion, and automated stakeholder communication.
Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities. You will work with engineering to define agent roles, communication protocols, handoff conditions, and safety boundaries ensuring that self-governing agents act with appropriate confidence and escalate appropriately when uncertainty is high.
Translate complex operational and technical requirements into a prioritized backlog, making clear tradeoffs between feature depth, platform scalability, and autonomous site readiness milestones. You will serve as the voice of Incident Managers, domain engineers, and Operations Control Center stakeholders, deeply understanding their daily workflows and advocating for their needs during executive-level planning and prioritization.
Define and track the business case across all three programs — including mean time to resolve improvements, lost labor hour reduction, and first page resolution improvement — to secure continued investment. You will establish mechanisms to measure platform performance against key metrics including auto-detection rate, false positive rate, consolidation accuracy, and remediation success rate, iterating rapidly based on data.
Drive cross-functional alignment across Fulfillment Technologies, Robotics, Network Engineering, Application teams, and Operations to ensure the platform's cross-domain orchestration model is well understood and adopted. You will lead executive-level reviews of program progress, risks, and investment cases, communicating clearly about the path from near-term detection improvements to longer-term autonomous site readiness.

A day in the life You spend most of your time at the intersection of product strategy and hands-on technical work. A typical day might start by pulling incident data into a notebook to test a new detection signal, then jumping into a whiteboard session with engineers debating multi-agent handoff reasoning. You might prototype a diagnostic flow in the afternoon just to prove a concept is worth building. And occasionally you will find yourself in the operations center watching real operators work through a network failure — because staying grounded in how people actually experience the platform is what separates good product selections from great ones.

Amazon offers a full range of benefits that support you and eligible family members, including domestic partners and their children. Benefits can vary by location, the number of regularly scheduled hours you work, length of employment, and job status such as seasonal or temporary employment.

The benefits that generally apply to regular, full-time employees include:

Medical, Dental, and Vision Coverage
Maternity and Parental Leave Options
Paid Time Off (PTO)
401(k) Plan

If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you!

At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply!

About the team The Infrastructure Reliability team sits within Amazon's Robotics organization, operating as the cross-domain orchestration layer for a fulfillment network that processes customer orders continuously across thousands of sites. Our mission is simple and purposeful: operations never stop, no matter what breaks. We do not own any single domain — instead, we build the platform that sees across all of them, identifying failures that cascade across team boundaries and coordinating the capabilities that domain teams have built to resolve those failures faster than any single team could alone. We are now building the AI-powered platform that applies machine learning, reasoning, and multi-agent orchestration to take our results from promising to industry-defining. We value expert rigor, customer obsession, and hands-on technical depth. The ideal teammate is as comfortable writing a proof-of-concept as they are writing a product strategy document. If you want to work on a problem that is technically fascinating, operationally critical, and commercially enormous, this is the team for you.

Basic Qualifications

Bachelor's degree
Experience owning/driving roadmap strategy and definition
Experience with feature delivery and tradeoffs of a product
Experience contributing to engineering discussions around technology decisions and strategy related to a product
Experience managing technical products or online services
Experience in representing and advocating for a variety of critical customers and stakeholders during executive-level prioritization and planning

Preferred Qualifications

Experience in using analytical tools, such as Tableau, Qlikview, QuickSight
Experience in building and driving adoption of new tools

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, MA, North Reading - 151,200.00 - 204,600.00 USD annually USA, TN, Nashville - 143,700.00 - 194,300.00 USD annually USA, TX, Austin - 151,200.00 - 204,600.00 USD annually USA, VA, Arlington - 151,200.00 - 204,600.00 USD annually

Key job responsibilities

Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention. This means defining the vision, strategy, and success metrics for AI-powered progressive detection, incident consolidation, self-governing remediation orchestration, and cross-domain observability capabilities that serve thousands of fulfillment sites globally.
Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources. Whether prototyping a multi-agent reasoning pipeline, exploring a new anomaly detection approach, or stress-testing an LLM prompt chain against real incident data, you will use your technical skills to compress the distance between idea and validated direction.
Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures. You will engage meaningfully with data scientists on model architecture selections, feature engineering tradeoffs, and evaluation frameworks — understanding not just what a model produces but why, and whether that reasoning can be trusted in a production environment where self-governing remediation choices carry real operational risk.
Apply your understanding of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds. You will shape how LLMs are applied to diagnostic summarization, resolution suggestion, and automated stakeholder communication.
Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities. You will work with engineering to define agent roles, communication protocols, handoff conditions, and safety boundaries ensuring that self-governing agents act with appropriate confidence and escalate appropriately when uncertainty is high.
Translate complex operational and technical requirements into a prioritized backlog, making clear tradeoffs between feature depth, platform scalability, and autonomous site readiness milestones. You will serve as the voice of Incident Managers, domain engineers, and Operations Control Center stakeholders, deeply understanding their daily workflows and advocating for their needs during executive-level planning and prioritization.
Define and track the business case across all three programs — including mean time to resolve improvements, lost labor hour reduction, and first page resolution improvement — to secure continued investment. You will establish mechanisms to measure platform performance against key metrics including auto-detection rate, false positive rate, consolidation accuracy, and remediation success rate, iterating rapidly based on data.
Drive cross-functional alignment across Fulfillment Technologies, Robotics, Network Engineering, Application teams, and Operations to ensure the platform's cross-domain orchestration model is well understood and adopted. You will lead executive-level reviews of program progress, risks, and investment cases, communicating clearly about the path from near-term detection improvements to longer-term autonomous site readiness.

The benefits that generally apply to regular, full-time employees include:

Medical, Dental, and Vision Coverage
Maternity and Parental Leave Options
Paid Time Off (PTO)
401(k) Plan

If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you!

At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply!

Basic Qualifications

Bachelor's degree
Experience owning/driving roadmap strategy and definition
Experience with feature delivery and tradeoffs of a product
Experience contributing to engineering discussions around technology decisions and strategy related to a product
Experience managing technical products or online services
Experience in representing and advocating for a variety of critical customers and stakeholders during executive-level prioritization and planning

Preferred Qualifications

Experience in using analytical tools, such as Tableau, Qlikview, QuickSight
Experience in building and driving adoption of new tools

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Senior Product Manger - Tech, Infrastructure Reliability

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Basic Qualifications

Preferred Qualifications

Basic Qualifications

Preferred Qualifications