Software Development Engineer (elastic Kubernetes Service), Eks Scalability & Performance

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer role focused on the scalability and performance of Amazon EKS control planes, specifically the Vertical Auto-Scaling Service (VAS) and its successor. The role involves designing, building, and operating systems that ensure reliable and performant control planes for various workloads, including large-scale AI/ML and generative AI. Responsibilities include managing autoscaling services, SLA measurement pipelines, and contributing to control plane architecture for massive clusters. The role also involves upstream Kubernetes community engagement and potentially other areas like workload identity or capacity management.

What you'd actually do

  1. You will build and operate the Vertical Auto-Scaling Service (VAS) and its next-generation successor (VAS 2.0), which dynamically right-sizes EKS control planes by evaluating CPU/memory utilization, etcd throttle rates, node-count thresholds, and network utilization simultaneously.
  2. You will work on the SLA measurement pipeline (MinutelySLA → DailySLA → MonthlySLA) that enforces EKS's uptime commitments, investigating breaching clusters weekly and building automation to detect and mitigate degradation before customers notice.
  3. You will contribute to the control plane architecture for EKS Ultraclusters, defining how the API server, etcd, and associated components scale to support 100,000-node clusters running generative AI workloads.
  4. You will maintain and extend version release qualification scale tests that gate every new Kubernetes version before it reaches customers.
  5. You will engage with the upstream Kubernetes community — driving KEPs that work backwards from EKS customer requirements around performance, scale, and resiliency.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 1+ years of software development engineer or related occupational experience
  • 1+ years of designing and developing large-scale, multi-tiered, multi-threaded, embedded or distributed software applications, tools, systems, and services using: C#, C++, Java, or Perl experience
  • 1+ years of Object Oriented Design experience
  • Bachelor's degree or foreign equivalent in Computer Science, Engineering, Mathematics, or a related field
  • Experience programming with at least one software programming language

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • hardest distributed systems problems
  • full stack
  • own systems end-to-end
  • measured by customer outcomes
  • scale to support 100,000-node clusters