System Development Manager, Aws Resilience, Aws Incident Response, Aws Incident Response

Amazon Amazon · Big Tech · Seattle, WA · Systems, Quality, & Security Engineering

System Development Manager for AWS Incident Response (AIR) team in Seattle. Leads engineers in operational excellence and tooling delivery for incident response, detection, and observability. Owns incident response leadership, detection & observability improvement, cross-site coordination, post-incident analysis, and team development. Focuses on systems engineering and operational leadership to improve AWS resiliency.

What you'd actually do

  1. Own the operational readiness of your team for high-severity incident response. Ensure your engineers can lead calls effectively - triaging, coordinating resolvers, communicating clearly under pressure, and driving incidents to mitigation.
  2. Drive improvements in how AIR detects and responds to AWS health events. Use data and learnings from real incidents to improve detection speed, accuracy, and coverage. Explore new approaches, including generative AI, to find leaps in detection and response capabilities. Make the important obvious and automate the routine.
  3. Coordinate with peer managers in Sydney and Dublin and partner with alarming, detection, and incident tooling teams. Establish clear communication channels and feedback loops across the incident management ecosystem.
  4. Lead root-cause analysis and post-incident reviews. Ensure learnings drive corrective actions and that service teams follow through, closing the loop so each incident makes AWS stronger.
  5. Own all facets of performance and career management for your team. Grow engineers at all levels, manage operational load, and maintain a high bar for hiring.

Skills

Required

  • 1+ years of engineering team management experience
  • 5+ years of experience in systems engineering, systems development, or infrastructure operations
  • Strong understanding of distributed systems, networking fundamentals, and infrastructure failure modes
  • Excellent communication skills, particularly the ability to convey technical complexity clearly and quickly under pressure

Nice to have

  • Experience hiring, developing and promoting engineering talent
  • Experience using data to drive root cause elimination and process improvement
  • Experience managing communication with geographically distributed teams
  • Experience with operational best practices: monitoring, alerting, and post-incident analysis

What the JD emphasized

  • operational excellence
  • tooling delivery
  • incident response
  • observability
  • detection
  • response speed
  • accuracy
  • resiliency
  • systems engineering
  • operational leadership
  • high-severity incident response
  • lead calls effectively
  • triaging
  • coordinating resolvers
  • communicating clearly under pressure
  • driving incidents to mitigation
  • tooling and automation
  • detection speed
  • accuracy
  • coverage
  • generative AI
  • detection and response capabilities
  • make the important obvious
  • automate the routine
  • peer managers
  • alarming
  • detection
  • incident tooling teams
  • communication channels
  • feedback loops
  • incident management ecosystem
  • root-cause analysis
  • post-incident reviews
  • learnings
  • corrective actions
  • service teams
  • closing the loop
  • performance
  • career management
  • grow engineers
  • operational load
  • high bar for hiring