Senior Machine Learning Engineer

Expedia Expedia · Hospitality · Bangalore, India

Senior Machine Learning Engineer at Expedia Group to build and operate ML systems for the TravelAds platform, focusing on high-throughput, low-latency inference, ML lifecycle automation, and agentic AI workflows. The role involves designing and owning ML systems, building ML infrastructure, accelerating the ML lifecycle, developing LLM/RAG-powered workflows, and implementing ML observability and guardrails.

What you'd actually do

  1. Design and own high-throughput, low-latency ML systems (2000+ RPS) for TravelAds, including multi-service training and serving architectures, auction and ranking models, and real-time inference services that meet strict sub-100ms SLAs.
  2. Build and evolve ML infrastructure and data foundations – feature stores, online/offline feature pipelines, embedding and vector services, and data lineage and versioning – that power ad relevance, bidding optimization, experimentation, and model evaluation at scale.
  3. Accelerate the end-to-end ML lifecycle by automating training, validation, deployment, shadow testing, A/B testing, and retraining using orchestrated workflows (e.g., Flyte, Airflow) and robust quality gates.
  4. Develop agentic AI and LLM/RAG-powered workflows that automate ML operations (training, deployment, validation, monitoring, calibration) and enable AI-assisted dataset creation, operational analysis, and decision support.
  5. Define and implement ML observability, reliability, and cost guardrails through drift and feature-freshness monitoring, health dashboards, SLO/SLI definitions, incident response, and resilience-focused improvements.

Skills

Required

  • Python
  • Java/Kotlin/Scala
  • distributed systems
  • data structures
  • performance optimization
  • system design (HLD/LLD)
  • serving stacks
  • monitoring and observability
  • rollbacks
  • operational rigor
  • technical design for multi-quarter ML projects
  • partnering with Product and business stakeholders

Nice to have

  • real-time ML inference at high throughput
  • Spark
  • Hive
  • Databricks
  • Airflow
  • Flyte
  • AWS SageMaker
  • EKS
  • EMR
  • Docker
  • CI/CD for ML
  • automated training pipelines
  • deployment orchestration
  • data lineage and versioning
  • drift detection
  • feature-freshness monitoring
  • model health dashboards
  • offline/online parity validation
  • incident response
  • root cause analysis
  • LLM productionization
  • RAG architectures
  • agentic AI workflows

What the JD emphasized

  • high-throughput, low-latency ML systems
  • sub-100ms SLAs
  • end-to-end ML lifecycle
  • agentic AI and LLM/RAG-powered workflows
  • ML observability, reliability, and cost guardrails
  • Proven track record of designing, building, and operating production ML or large-scale distributed systems
  • real-time ML inference at high throughput (1000+ RPS or more) and strict latency SLAs
  • LLM productionization, RAG architectures, or agentic AI workflows

Other signals

  • ML systems behind TravelAds
  • automating the end-to-end lifecycle
  • agentic AI workflows
  • LLM-powered solutions for ad relevance
  • live inference at scale