build, deploy, and manage applications with unparalleled flexibility and efficiency.
Join our dynamic team, where we apply agentic and machine-learning solutions to one of the hardest problems in the fleet: returning broken servers to production when there is no deterministic signal of what is wrong. You will build a learning, agent-driven decision engine on top of fleet telemetry and repair history, and you will ship it as a production software service that operates across millions of servers in every region, serving every EC2 line of business from core servers to accelerators and UltraServers. Sentinel is a direct lever on unsellable capacity and on the cost of running the fleet, and we are evolving it from human-authored decision rules into a system that recovers capacity on its own.
We are looking for an experienced Software Development Manager (SDM) to lead this team. The ideal candidate has led teams, thoroughly understands the design, development, and debugging of large-scale distributed systems, and is excited to apply ML and agentic techniques to real hardware-recovery problems. In this role, the manager will work with a broad group of technical teams across hardware, firmware, vetting, and provisioning.
Key job responsibilities
- Lead and inspire a team of engineers, providing guidance, mentorship, and support to foster their professional growth.
- Own the recovery decision engine that returns broken servers to sellable capacity, driving down unsellable rate and the time a host stays stuck. Take on the failures that have no deterministic signal, and evolve the engine from static, human-authored signatures into an agentic, ML-driven system that infers the right repair from fleet outcomes and improves with every recovery.
- Build and operate this as a production software service — reliable, secure, and observable — running across millions of servers in every region, not a set of offline models or scripts.
- Debug complex, system-level, multi-component failures across hardware, firmware, BMC, and the provisioning and vetting stack, and turn that diagnosis into automated, repeatable recovery.
- Collaborate with hardware engineering, firmware, component owners, vetting, and provisioning teams to expand recovery coverage across platforms and drive failures upstream to their root cause so they stop recurring.
- Raise the bar on the safety of autonomous action on production-bound capacity, holding a high security and operational standard for a service that runs across all regions, including restricted environments.
- Champion best practices in software engineering, including code quality, testing, automation, and continuous integration and delivery (CI/CD).
Basic Qualifications
- 3+ years of engineering team management experience
- 7+ years of working directly within engineering teams experience
- 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
- 8+ years of leading the definition and development of multi tier web services experience
- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
- Experience partnering with product or program management teams
Preferred Qualifications
- Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
- Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, WA, Seattle - 184,900.00 - 250,200.00 USD annually