Principal Software Development Engineer - Gen AI

Expedia Expedia · Hospitality · Bangalore, India

Expedia is hiring a Principal GenAI Engineer to build production-grade AI systems for accelerating business workflows. The role involves designing and implementing cloud-native GenAI architectures, creating shared capabilities like model routing and retrieval services, optimizing for production performance and resilience, implementing guardrails and trust mechanisms, building RAG pipelines, designing agentic workflows, implementing tool registries for agents, collaborating with ML engineers, defining evaluation strategies, and ensuring governance, security, and responsible AI practices. The role requires a strong background in distributed systems and platform architecture, with at least 3 years of experience in AI/ML or building production LLM systems.

What you'd actually do

  1. Platform Architecture & Scalability: Design and implement cloud‑native, cost‑efficient GenAI architectures (services, APIs, data paths, and infrastructure) that are production-ready, observable, and resilient.
  2. GenAI Platform Enablement: Create shared capabilities such as model routing/gateways, prompt/config management, retrieval services, evaluation harnesses, and libraries so multiple product teams can ship consistently.
  3. Production Performance & Resilience: Tackle deployment realities including latency/throughput optimization, caching strategies, rate limiting, multi‑tenant isolation, failure handling, and “safe fallbacks” when models or dependencies degrade.
  4. Reliability, Guardrails & Trust: Implement techniques to reduce hallucinations and variability via grounding, structured outputs, tool use, and robust guardrails. Ensure systems are testable, measurable, and maintainable over time.
  5. Retrieval & Knowledge Systems (RAG): Build ingestion and retrieval pipelines (chunking, embeddings, metadata, hybrid retrieval, reranking) so LLMs can answer with evidence/citations and predictable quality.

Skills

Required

  • design and implement cloud-native, cost-efficient GenAI architectures
  • model routing/gateways
  • prompt/config management
  • retrieval services
  • evaluation harnesses
  • latency/throughput optimization
  • caching strategies
  • rate limiting
  • multi-tenant isolation
  • failure handling
  • grounding
  • structured outputs
  • tool use
  • guardrails
  • ingestion and retrieval pipelines
  • chunking
  • embeddings
  • metadata
  • hybrid retrieval
  • reranking
  • multi-step and multi-agent workflows
  • state management
  • error recovery
  • tool registry
  • integrations
  • secure tool registry
  • human approval flows
  • ML models (forecasting, anomaly detection, classification, ranking)
  • offline and online evaluation strategies
  • golden datasets
  • regression testing
  • LLM-as-judge
  • safety/robustness testing
  • agent identity
  • least-privilege access
  • secret handling
  • data classification
  • audit trails
  • responsible AI
  • success metrics
  • instrument systems end-to-end
  • quality
  • latency
  • cost
  • adoption
  • proof-of-concepts
  • hardened, supported products
  • distributed systems
  • platform architecture
  • Kubernetes
  • AWS
  • microservices
  • Python

Nice to have

  • Java
  • Kotlin
  • TensorFlow
  • n8n
  • Temporal
  • AWS Step Functions

What the JD emphasized

  • production-grade AI systems
  • scalable, secure, and observable systems
  • production-ready
  • production performance
  • production LLM systems

Other signals

  • building production-grade AI systems
  • using LLMs, retrieval (RAG), agentic workflows, and ML
  • set technical direction
  • define golden paths
  • raise engineering standards for how GenAI is built and operated