Principal Data Scientist - Agent Builder

Elastic Elastic · Enterprise · Netherlands · Enterprise Search - Workchat

Principal Data Scientist to lead the technical direction for evaluating, improving, and scaling chat quality within Elastic's conversational and agentic platform. This role involves defining evaluation strategies, designing quality metrics for RAG, agents, and tools, and partnering with engineering to productionize evaluation pipelines and guardrails. The focus is on applied leadership to prototype, evaluate, and influence roadmap for AI-driven product improvements.

What you'd actually do

  1. Define the evaluation strategy for conversational and agentic search, including offline and online evaluation, golden datasets, rubrics, LLM-as-judge calibration, groundedness and citation checks, and A/B testing.
  2. Lead the design of quality metrics and decision frameworks for RAG, agents, tools, model selection, agent routing, prompt behavior, and cost/latency trade-offs.
  3. Build, compare, and guide improvements across retrieval and re-ranking approaches, including sparse and dense retrieval, vector search, query understanding, semantic rewrites, and context enrichment.
  4. Turn experimental results into product and business decisions: which models to use, how to route requests efficiently, which tools should be exposed, and how agents should be customized for different Elastic use cases.
  5. Partner with engineering to productionize evaluation pipelines, telemetry, dashboards, CI guardrails, and regression detection for chat quality, helpfulness, dedication, latency, and cost.

Skills

Required

  • Python
  • PyTorch/Transformers
  • Pandas
  • notebooks
  • reproducible experiments
  • versioned datasets
  • clean, reviewable code
  • IR
  • NLP
  • ranking
  • semantic search
  • RAG
  • LLM-powered product experiences
  • evaluation for production AI/ML systems
  • offline metrics
  • online experimentation
  • LLM-as-judge approaches
  • groundedness
  • citation quality
  • model comparison
  • retrieval systems
  • dense retrieval
  • sparse retrieval
  • re-ranking
  • vector search
  • query understanding
  • evaluation metrics (nDCG, MRR, Recall@k, precision)
  • latency/cost trade-offs
  • telemetry design
  • dashboards
  • CI guardrails
  • quality regression tracking
  • collaboration with engineering teams

Nice to have

  • Elasticsearch
  • ES|QL

What the JD emphasized

  • core quality layer for RAG, agents and tools
  • evaluate, improve, and scale chat quality
  • define the evaluation strategy
  • productionize evaluation pipelines
  • applied leadership role
  • prototype, evaluate, influence roadmap direction
  • deep expertise in IR, NLP, ranking, semantic search, RAG, or LLM-powered product experiences
  • Strong track record defining and leading evaluation for production AI/ML systems
  • Experience influencing product and technical strategy through data
  • Hands-on ability with Python, PyTorch/Transformers, Pandas, notebooks, reproducible experiments, versioned datasets, and clean, reviewable code.
  • Strong understanding of retrieval systems
  • Experience collaborating closely with engineering teams to move from prototype to production

Other signals

  • building agentic platforms
  • evaluating and improving chat quality
  • productionizing evaluation pipelines
  • influencing roadmap for AI products