Principal Software Development Engineer (kubernetes, Aws)

Expedia Expedia · Hospitality · CA

Expedia Group is seeking a Principal Software Development Engineer to lead the architecture, design, and building of a compute runtime platform based on Kubernetes. This role involves evolving the Kubernetes-based environment, architecting for scale and reliability, improving the developer control plane with CI/CD and GitOps, automating infrastructure lifecycle with IaC, and providing technical leadership and mentorship. The engineer will also be involved in production debugging and collaborating with various teams to deliver platform capabilities.

What you'd actually do

  1. Design and Implement Core Platform Components: Evolve our Kubernetes-based environment, focusing on areas like multi-tenancy, network policy, resource management, and service mesh integration (e.g., Istio, Linkerd).
  2. Architect for Scale and Reliability: Lead the technical design for scaling our control plane and data plane to handle a 10x increase in services and traffic. Define and implement SLOs for the platform itself.
  3. Improve the Developer Control Plane: Design and build the next generation of our CI/CD pipelines and GitOps workflows. Drive the strategy for our internal developer portal (e.g., Backstage) to unify tooling, documentation, and service lifecycle management.
  4. Automate Infrastructure Lifecycle: Author and maintain production-grade Infrastructure as Code (IaC) using Terraform and/or Crossplane. Eliminate manual toil by automating cluster provisioning, node lifecycle, and dependency upgrades.
  5. Technical Leadership and Mentorship: Act as a force multiplier. Mentor senior engineers on the team, lead architecture review sessions, and author RFCs to build consensus on significant technical decisions. Your influence will extend beyond the team to application developers and SREs.

Skills

Required

  • infrastructure automation
  • configuration management
  • container orchestration
  • Java
  • Go
  • Python
  • Ruby
  • cloud computing
  • Amazon Web Services (AWS)
  • Docker
  • Kubernetes/EKS

Nice to have

  • Stateless and Stateful workloads
  • Service Mesh
  • Service Discovery
  • Monitoring
  • Alerting
  • Logging
  • security development principles
  • token management
  • encryption
  • certificates
  • Continuous Integration tools
  • Jenkins
  • self-service technology platform capabilities
  • container compute
  • traffic management
  • API management
  • mentoring other engineers
  • establishing standards for operational excellence
  • code quality

What the JD emphasized

  • Evolve our Kubernetes-based environment
  • scaling our control plane and data plane to handle a 10x increase in services and traffic
  • next generation of our CI/CD pipelines and GitOps workflows
  • Automate Infrastructure Lifecycle
  • production incidents that involve the underlying platform, from kernel-level issues to CNI bugs to distributed system failures