Senior Staff Machine Learning Engineer, Data & Eval

Airbnb Airbnb · Consumer · United States · Software Engineering

Senior Staff ML Engineer focused on setting technical direction and leading execution for ML evaluation and the end-to-end data flywheel powering CSxAI products. This role defines how quality is measured, how feedback is turned into learning signals, and how models and products are continuously improved safely and efficiently.

What you'd actually do

  1. Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes.
  2. Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge where appropriate) with strong controls for bias, drift, and reliability.
  3. Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement.
  4. Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what “good” looks like and how teams act on evaluation results.
  5. Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy).

Skills

Required

  • PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience)
  • 10+ years building, testing, and shipping ML/AI systems end-to-end
  • 2+ years of experience with GenAI/LLM systems in production
  • 5+ years leading large, ambiguous technical initiatives as a senior IC
  • Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing)
  • Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, memory, etc.
  • Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance)
  • Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management)

Nice to have

  • Customer Support Systems: Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA)
  • Infrastructure & Quality at Scale: Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement
  • Agile Practice for Applied AI: Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor

What the JD emphasized

  • end-to-end
  • evaluation
  • data flywheel
  • quality
  • GenAI
  • LLM
  • evaluation methodology
  • GenAI systems
  • data pipelines
  • model lifecycle management
  • evaluation platforms
  • applied AI

Other signals

  • GenAI
  • LLM
  • evaluation
  • data flywheel
  • quality