Principal Software Development Engineer - Cloud Platform

Expedia Expedia · Hospitality · CA

Expedia is seeking a Principal Software Development Engineer to architect their Cloud Platform. This role focuses on evolving the platform to handle increasing code volume and service complexity, ensuring reliability, optimizing cloud economics, and improving developer experience. Key responsibilities include leading architectural evolution towards a Cell-Based Architecture, modernizing Kubernetes and infrastructure, hardening reliability and observability, optimizing cloud economics through FinOps, and supporting developer workflows with agent-friendly infrastructure and standardized Dev Containers. The role requires deep expertise in cloud-native distributed systems, Kubernetes, observability, and Infrastructure as Code, with a preference for experience integrating AI/ML solutions into platform services.

What you'd actually do

  1. Lead Architectural Evolution: You’ll own the move toward a Cell-Based Architecture. We need to move away from fragile, monolithic clusters and toward isolated, predictable failure domains that allow us to scale horizontally with confidence.
  2. Modernize Kubernetes & Infrastructure: You’ll define our K8s strategy, focusing on multi-cluster management, service mesh, and automated scaling. You need to ensure our "Golden Path" makes it easy for engineers to do the right thing by default.
  3. Hardened Reliability & Observability: You will set the standards for SRE across the org. This means moving beyond basic dashboards to causal observability, automated incident response, and rigorous SLO/SLI management. You’ll help us engineer out the root causes of systemic instability.
  4. Optimize Cloud Economics: You’ll lead our FinOps technical strategy. You need to build the tooling and visibility that allows us to understand cost-per-service and ensures our infrastructure spend is directly tied to business value.
  5. Support the Developer Workflow: While we are embracing AI tools, your job is to build the underlying "agent-friendly" infrastructure. This includes standardized Dev Containers and ephemeral environments that allow for fast, isolated iteration without clobbering shared state.

Skills

Required

  • Extensive professional software development experience designing, building, and operating large-scale, cloud-native distributed systems and platform services on Kubernetes.
  • Proven ownership of critical services or multi-service platforms, including responsibility for system design (LLD), API design, data modeling, deployment, and ongoing operational health.
  • Deep expertise with at least one major public cloud provider and core platform technologies (compute, networking, storage, service discovery, security, observability, and CI/CD).
  • Demonstrated ability to make high-impact architectural decisions, navigate complex trade-offs, and guide multiple teams toward coherent, long-term technical direction.
  • Familiarity with AI-driven systems, tools, or workflows and applying AI/ML concepts to real world products within cloud or platform environments.
  • Deep knowledge of observability patterns (OpenTelemetry, Prometheus, distributed tracing).
  • Expert-level understanding of Infrastructure as Code (Terraform, Pulumi) and CI/CD at scale.
  • Proficiency in Go, Rust, or similar languages used in modern platform engineering.

Nice to have

  • Track record of defining and evolving multi-year technical strategies for cloud and developer platform ecosystems, and successfully driving adoption of shared platforms across many teams.
  • Experience designing and operating highly available, globally distributed systems at internet scale, including capacity planning, performance optimization, and robust failure handling.
  • Safely integrates and operates AI/ML‑enabled solutions that improve outcomes, such as intelligent routing, predictive scaling, or automated remediation embedded in platform services, with appropriate safeguards.
  • Advanced experience applying AI/ML techniques to cloud and platform problems (for example, cost optimization, anomaly detection, or performance tuning) and partnering with data/ML teams to productionize these capabilities.
  • A Systems Architect: You understand the deep plumbing of the cloud (AWS/GCP, K8s, networking). You think in terms of failure domains, latencies, and unit economics.
  • Reliability-First: You’ve carried a pager for global-scale systems. You have a healthy "paranoia" about state, consistency, and cascading failures.
  • Hands-on: You still love to build. You can prototype a complex infrastructure change in a weekend to prov

What the JD emphasized

  • primary architect of our technical future
  • explosion in code volume and service complexity
  • handle this growth without sacrificing reliability or skyrocketing our cloud bill
  • agentic developer tools
  • efficient Kubernetes footprint
  • observability stack provides signals, not just noise
  • Cell-Based Architecture
  • multi-cluster management
  • automated scaling
  • causal observability
  • automated incident response
  • rigorous SLO/SLI management
  • FinOps technical strategy
  • understanding cost-per-service
  • infrastructure spend is directly tied to business value
  • agent-friendly infrastructure
  • standardized Dev Containers
  • ephemeral environments
  • applying AI/ML concepts to real world products within cloud or platform environments
  • Safely integrates and operates AI/ML‑enabled solutions that improve outcomes, such as intelligent routing, predictive scaling, or automated remediation embedded in platform services, with appropriate safeguards.
  • Advanced experience applying AI/ML techniques to cloud and platform problems (for example, cost optimization, anomaly detection, or performance tuning) and partnering with data/ML teams to productionize these capabilities.