Machine Learning Engineer, LLM Evals & Observability

Glean Glean · Enterprise · San Francisco, CA · Engineering

Machine Learning Engineer focused on LLM Evals & Observability for an enterprise AI platform. The role involves designing evaluation datasets, building large-scale evaluation pipelines, creating LLM-powered judges, evaluating new models before launch, and building observability infrastructure for AI agents. The goal is to ensure the reliability and quality of Glean's AI Assistant and Agents.

What you'd actually do

  1. Design and curate evaluation datasets – sampling strategies, query diversity, and golden sets that give reliable, representative coverage of real assistant behavior.
  2. Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries.
  3. Build LLM-powered judges that score metrics like correctness, completeness, and response quality, and align them against human judgment.
  4. Evaluate new models and product changes before they ship – providing the quality signal that gates launches and prevents regressions.
  5. Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable.

Skills

Required

  • 2+ years of software engineering experience
  • strong coding skills
  • Strong backend fundamentals in Go and Python
  • comfortable with distributed data pipelines
  • Experience working with LLM evaluation
  • reinforcement learning from human feedback
  • natural language processing
  • large systems involving machine learning
  • Analytically rigorous
  • customer-focused
  • tight-knit and cross-functional environment
  • team player
  • willing to take on whatever is most impactful for the company
  • care about quality

What the JD emphasized

  • evaluating new models and product changes before they ship
  • quality signal that gates launches
  • prevent regressions
  • eval results
  • customer feedback
  • automated prompt iteration

Other signals

  • LLM evaluation
  • agent observability
  • quality measurement
  • gating launches