Software Development Engineer II - Amazon Msk, Managed Streaming Kafka (msk), Msk Infrastructure Management

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer II role focused on building and operating automation for maintaining a large-scale fleet of stateful Apache Kafka hosts. The role involves designing and implementing systems for patching, host remediation, rollout/rollback mechanisms, and end-to-end service ownership, with an emphasis on reliability and automation to ensure invisible infrastructure maintenance for customers. The team is increasingly using generative AI to strengthen these mechanisms.

What you'd actually do

  1. Design, build, and operate automation that patches and maintains hundreds of thousands of stateful hosts, keeping fleet maintenance invisible to customers.
  2. Build systems that automatically detect unhealthy hosts and remediate them, balancing fast recovery against avoiding needless disruption.
  3. Develop rollout and rollback mechanisms that keep the blast radius of any change small at fleet scale, and that let changes be tested before they reach customers and reversed if something goes wrong.
  4. Own your services end to end: take part in on-call, debug production issues, and continually reduce the manual effort needed to operate the fleet.
  5. Write design documents, collaborate with engineers across MSK, and raise the engineering bar through design and code reviews.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • stateful hosts
  • automation
  • fleet scale
  • end to end