AI Evaluations Lead

Cresta Cresta · Vertical AI · United States · Remote · Customer Success

Lead the AI evaluation strategy and systems for Cresta's AI Agent product line, focusing on ensuring reliability, accuracy, and empathy at scale. This role involves designing test plans for LLMs, building automated testing environments, conducting manual UAT, and leading a team of QA analysts. The ideal candidate has strong systems thinking, operational leadership, and a "QA nose" for edge cases, with experience in AI or high-growth SaaS environments.

What you'd actually do

  1. Architect & Scale AI Evaluation Systems: Design and oversee the end-to-end framework for testing AI agent systems at scale. You will leverage LLM-driven methodologies—including automated simulations, "LLM-on-LLM" rubrics, and adversarial red-teaming—to ensure reliability, policy adherence, and logic across complex, multi-turn conversational flows.
  2. Drive Deployment Excellence: Partner with Forward Deployed Engineers and PMs to triage issues, identify bottlenecks, and create new test cases on the fly to address real-world deployment challenges.
  3. Be the Customer’s Voice: Conduct manual UAT and voice-call testing to represent the end-customer experience. You take it personally when an agent lacks empathy or clarity, and you excel at articulating these nuances to the engineering team and clients.
  4. Lead and Scale the Team: lead a pod of QA analysts and partners. You will define the best practices, communication loops, and shared knowledge base that allow the QA function to scale alongside our rapidly growing product line.

Skills

Required

  • 5+ years of experience in Quality Engineering, Deployments, or Technical QA
  • Systems Thinking
  • Technical intuition and curiosity about LLMs
  • Comfort navigating technical concepts like LLM, RAG, prompt logic, and multi-turn conversational flows
  • Operational Leadership
  • Ability to lead large E2E technical projects through partners
  • Experience building processes that improve efficiency between QA, Engineering, and Product
  • Strong ability to find edge cases and bias toward action
  • Ability to anticipate bottlenecks and deliver solutions with urgency
  • High Empathy and consultative mindset
  • Ability to represent the 'human element' of customer support
  • Startup Agility and ability to turn ambiguity into execution

Nice to have

  • Experience in an AI or high-growth SaaS environment
  • Experience with CCaaS (Contact Center as a Service), telephony, or STT/TTS (Speech-to-Text) technologies
  • Background in Conversation Design or SDET roles
  • Experience leading team with direct reports

What the JD emphasized

  • ultimate guardian of the customer experience
  • end-to-end quality strategy
  • designing complex test plans for non-deterministic LLMs
  • building automated and scalable testing environments
  • You aren't just looking for bugs; you are building the framework that allows Cresta to deploy world-class AI agents for the world's largest enterprises with total confidence.
  • Architect & Scale AI Evaluation Systems
  • LLM-driven methodologies
  • automated simulations
  • LLM-on-LLM rubrics
  • adversarial red-teaming
  • ensure reliability, policy adherence, and logic
  • multi-turn conversational flows
  • Drive Deployment Excellence
  • triage issues
  • identify bottlenecks
  • create new test cases on the fly
  • real-world deployment challenges
  • Be the Customer’s Voice
  • manual UAT
  • voice-call testing
  • represent the end-customer experience
  • agent lacks empathy or clarity
  • articulating these nuances
  • Lead and Scale the Team
  • lead a pod of QA analysts and partners
  • define the best practices
  • communication loops
  • shared knowledge base
  • scale alongside our rapidly growing product line
  • 5+ years of experience
  • Quality Engineering
  • Deployments
  • Technical QA
  • AI or high-growth SaaS environment
  • Systems Thinking
  • strong technical intuition
  • curiosity about how LLMs work
  • comfortable navigating technical concepts
  • LLM
  • RAG
  • prompt logic
  • multi-turn conversational flows
  • Operational Leadership
  • Proven ability to large E2E technical projects
  • building processes that improve efficiency
  • QA, Engineering, and Product
  • The "QA Nose"
  • uncanny ability to find the edge case
  • bias toward action
  • anticipate bottlenecks
  • deliver solutions with urgency
  • High Empathy
  • consultative mindset
  • represent the "human element"
  • customer support interaction
  • Startup Agility
  • fast-paced environments
  • turning ambiguity into execution
  • comfortable "rolling up your sleeves" to build

Other signals

  • AI Evaluations Lead
  • own the end-to-end quality strategy
  • designing complex test plans for non-deterministic LLMs
  • building automated and scalable testing environments
  • architect & scale AI evaluation systems
  • LLM-driven methodologies
  • automated simulations
  • LLM-on-LLM rubrics
  • adversarial red-teaming
  • ensure reliability, policy adherence, and logic
  • multi-turn conversational flows
  • lead and scale the team
  • systems thinking
  • technical intuition and curiosity about how LLMs work
  • comfortable navigating technical concepts like LLM, RAG, prompt logic, and multi-turn conversational flows