Member of Technical Staff (data Scientist, Evals)

Perplexity Perplexity · AI Frontier · London, United Kingdom · AI

This role focuses on building and maintaining automated evaluation pipelines to assess the quality of answers generated by LLMs in Perplexity's search engine and other products. It involves designing evaluation sets, developing VLM-based solutions for visual rendering assessment, and adapting public benchmarks. The goal is to directly shape product changes by providing key evaluation metrics.

What you'd actually do

  1. Architect and maintain automated evaluation pipelines to assess answer quality across Perplexity's products, ensuring high standards for accuracy and helpfulness
  2. Design evaluation sets and methods specifically to measure the impact of tool calls (particularly web search retrieval) on the final answer's quality
  3. Develop VLM-based solutions to programmatically evaluate how final answers render visually across different platforms and devices
  4. Continuously review public benchmarks and academic evaluations for their applicability to the Perplexity product, adapting and incorporating them into our regular performance measurements
  5. Operate within a small, high-impact team where your evaluation metrics directly shape product changes, collaborating closely with technical leadership to measure and improve Answer Quality

Skills

Required

  • Python
  • SQL
  • AWS
  • Databricks
  • data science
  • machine learning

Nice to have

  • LLMs at scale
  • LLM-as-a-judge setups
  • customer-facing web products
  • consumer apps
  • research methods
  • evaluation metrics
  • factual consistency
  • hallucination rate
  • retrieval precision
  • ground truth datasets

What the JD emphasized

  • specialized evals to improve answer quality
  • measure the impact of tool calls
  • VLM-based solutions
  • evaluation metrics directly shape product changes

Other signals

  • evaluating LLM answer quality
  • building evaluation pipelines
  • measuring impact of tool calls
  • VLM-based evaluation