Machine Learning Engineer, LLM Evals & Observability

Glean Glean · Enterprise · San Francisco, CA · Engineering

Machine Learning Engineer focused on LLM Evals & Observability for Glean's Work AI platform, responsible for designing evaluation datasets, building large-scale evaluation pipelines, developing LLM-powered judges, and creating observability infrastructure for AI agents.

What you'd actually do

  1. Design and curate evaluation datasets – sampling strategies, query diversity, and golden sets that give reliable, representative coverage of real assistant behavior.
  2. Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries.
  3. Build LLM-powered judges that score metrics like correctness, completeness, and response quality, and align them against human judgment.
  4. Evaluate new models and product changes before they ship – providing the quality signal that gates launches and prevents regressions.
  5. Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable.

Skills

Required

  • 2+ years of software engineering experience
  • strong coding skills
  • Strong backend fundamentals in Go and Python
  • comfortable with distributed data pipelines
  • Experience working with LLM evaluation, reinforcement learning from human feedback, natural language processing, or other large systems involving machine learning
  • Analytically rigorous

Nice to have

  • customer-focused
  • tight-knit and cross-functional environment
  • team player
  • willing to take on whatever is most impactful for the company

What the JD emphasized

  • evaluation datasets
  • evaluation pipelines
  • LLM-powered judges
  • quality signal
  • observability infrastructure
  • eval results
  • LLM evaluation
  • quality

Other signals

  • LLM Evals & Observability
  • evaluation pipelines
  • quality eval-sets
  • LLM-powered judges
  • agent observability