Software Development Manager (ec2 Nitro), Ec2 Core Provisioning

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Manager to lead a team building a learning, agent-driven decision engine on top of fleet telemetry and repair history to return broken servers to production. The system will operate across millions of servers and evolve from human-authored rules to an ML-driven system that recovers capacity autonomously.

What you'd actually do

  1. Lead and inspire a team of engineers, providing guidance, mentorship, and support to foster their professional growth.
  2. Own the recovery decision engine that returns broken servers to sellable capacity, driving down unsellable rate and the time a host stays stuck. Take on the failures that have no deterministic signal, and evolve the engine from static, human-authored signatures into an agentic, ML-driven system that infers the right repair from fleet outcomes and improves with every recovery.
  3. Build and operate this as a production software service — reliable, secure, and observable — running across millions of servers in every region, not a set of offline models or scripts.
  4. Debug complex, system-level, multi-component failures across hardware, firmware, BMC, and the provisioning and vetting stack, and turn that diagnosis into automated, repeatable recovery.
  5. Collaborate with hardware engineering, firmware, component owners, vetting, and provisioning teams to expand recovery coverage across platforms and drive failures upstream to their root cause so they stop recurring.

Skills

Required

  • engineering team management
  • designing or architecting of new and existing systems
  • leading the definition and development of multi tier web services
  • engineering practices and patterns for the full software/hardware/networks development life cycle
  • code quality
  • testing
  • automation
  • continuous integration and delivery (CI/CD)

Nice to have

  • communicating with users, other technical teams, and senior leadership
  • collect requirements
  • describe software product features, technical designs, and product strategy
  • recruiting
  • hiring
  • mentoring/coaching
  • managing teams of Software Engineers to improve their skills

What the JD emphasized

  • agentic
  • ML-driven system
  • production software service
  • millions of servers
  • hardest problems in the fleet
  • no deterministic signal

Other signals

  • learning, agent-driven decision engine
  • evolve the engine from static, human-authored signatures into an agentic, ML-driven system
  • ship it as a production software service that operates across millions of servers