AI Evaluations Lead

Cresta · Vertical AI · United States · Remote · Customer Success

Lead the AI evaluation strategy and systems for Cresta's AI Agent product line, focusing on ensuring reliability, accuracy, and empathy at scale. This role involves designing test plans for LLMs, building automated testing environments, conducting manual UAT, and leading a team of QA analysts. The ideal candidate has strong systems thinking, operational leadership, and a "QA nose" for edge cases, with experience in AI or high-growth SaaS environments.

What you'd actually do

Architect & Scale AI Evaluation Systems: Design and oversee the end-to-end framework for testing AI agent systems at scale. You will leverage LLM-driven methodologies—including automated simulations, "LLM-on-LLM" rubrics, and adversarial red-teaming—to ensure reliability, policy adherence, and logic across complex, multi-turn conversational flows.
Drive Deployment Excellence: Partner with Forward Deployed Engineers and PMs to triage issues, identify bottlenecks, and create new test cases on the fly to address real-world deployment challenges.
Be the Customer’s Voice: Conduct manual UAT and voice-call testing to represent the end-customer experience. You take it personally when an agent lacks empathy or clarity, and you excel at articulating these nuances to the engineering team and clients.
Lead and Scale the Team: lead a pod of QA analysts and partners. You will define the best practices, communication loops, and shared knowledge base that allow the QA function to scale alongside our rapidly growing product line.

Skills

Required

5+ years of experience in Quality Engineering, Deployments, or Technical QA
Systems Thinking
Technical intuition and curiosity about LLMs
Comfort navigating technical concepts like LLM, RAG, prompt logic, and multi-turn conversational flows
Operational Leadership
Ability to lead large E2E technical projects through partners
Experience building processes that improve efficiency between QA, Engineering, and Product
Strong ability to find edge cases and bias toward action
Ability to anticipate bottlenecks and deliver solutions with urgency
High Empathy and consultative mindset
Ability to represent the 'human element' of customer support
Startup Agility and ability to turn ambiguity into execution

Nice to have

Experience in an AI or high-growth SaaS environment
Experience with CCaaS (Contact Center as a Service), telephony, or STT/TTS (Speech-to-Text) technologies
Background in Conversation Design or SDET roles
Experience leading team with direct reports

What the JD emphasized

ultimate guardian of the customer experience
end-to-end quality strategy
designing complex test plans for non-deterministic LLMs
building automated and scalable testing environments
You aren't just looking for bugs; you are building the framework that allows Cresta to deploy world-class AI agents for the world's largest enterprises with total confidence.
Architect & Scale AI Evaluation Systems
LLM-driven methodologies
automated simulations
LLM-on-LLM rubrics
adversarial red-teaming
ensure reliability, policy adherence, and logic
multi-turn conversational flows
Drive Deployment Excellence
triage issues
identify bottlenecks
create new test cases on the fly
real-world deployment challenges
Be the Customer’s Voice
manual UAT
voice-call testing
represent the end-customer experience
agent lacks empathy or clarity
articulating these nuances
Lead and Scale the Team
lead a pod of QA analysts and partners
define the best practices
communication loops
shared knowledge base
scale alongside our rapidly growing product line
5+ years of experience
Quality Engineering
Deployments
Technical QA
AI or high-growth SaaS environment
Systems Thinking
strong technical intuition
curiosity about how LLMs work
comfortable navigating technical concepts
LLM
RAG
prompt logic
multi-turn conversational flows
Operational Leadership
Proven ability to large E2E technical projects
building processes that improve efficiency
QA, Engineering, and Product
The "QA Nose"
uncanny ability to find the edge case
bias toward action
anticipate bottlenecks
deliver solutions with urgency
High Empathy
consultative mindset
represent the "human element"
customer support interaction
Startup Agility
fast-paced environments
turning ambiguity into execution
comfortable "rolling up your sleeves" to build

Other signals

AI Evaluations Lead
own the end-to-end quality strategy
designing complex test plans for non-deterministic LLMs
building automated and scalable testing environments
architect & scale AI evaluation systems
LLM-driven methodologies
automated simulations
LLM-on-LLM rubrics
adversarial red-teaming
ensure reliability, policy adherence, and logic
multi-turn conversational flows
lead and scale the team
systems thinking
technical intuition and curiosity about how LLMs work
comfortable navigating technical concepts like LLM, RAG, prompt logic, and multi-turn conversational flows

Read full job description

Cresta is on a mission to turn every customer conversation into a competitive advantage by unlocking the true potential of the contact center. Our platform combines the best of AI and human intelligence to help contact centers discover customer insights and behavioral best practices, automate conversations and inefficient processes, and empower every team member to work smarter and faster. Born from the prestigious Stanford AI lab, Cresta's co-founder and chairman isSebastian Thrun, the genius behind Google X, Waymo, Udacity, and more. Our leadership also includes CEO,Ping Wu, the co-founder of Google Contact Center AI and Vertex AI platform,& co-founder, Tim Shi, an early member of Open AI.

We’ve assembled a world-class team of AI and ML experts, go-to-market leaders, and top-tier investors including Andreessen Horowitz, Greylock Partners, Sequoia, and former AT&T CEO John Donovan. Our valued customers include brands like Intuit, Cox Communications, Hilton, and Carmax and we’ve been recognized by Forbes and Bain Consulting as one of the top private AI companies in the world.

Join us on this thrilling journey to revolutionize the workforce with AI. The future of work is here, and it's at Cresta.

About the Role:

At Cresta, shipping AI is only half the story. Ensuring that AI interacts with humans reliably, accurately, and empathetically at scale is where the real challenge lies.

As the** Ai Evaluations Lead**, you will be the ultimate guardian of the customer experience for our AI Agent product line. This role is perfect for a strategic quality expert who loves the intersection of human psychology and machine logic. You will own the end-to-end quality strategy, from designing complex test plans for non-deterministic LLMs to building automated and scalable testing environments using Cresta's proprietary no-code test and evaluation tools.

You aren't just looking for bugs; you are building the framework that allows Cresta to deploy world-class AI agents for the world's largest enterprises with total confidence.

What You’ll Do:

Architect & Scale AI Evaluation Systems: Design and oversee the end-to-end framework for testing AI agent systems at scale. You will leverage LLM-driven methodologies—including automated simulations, "LLM-on-LLM" rubrics, and adversarial red-teaming—to ensure reliability, policy adherence, and logic across complex, multi-turn conversational flows.
Drive Deployment Excellence: Partner with Forward Deployed Engineers and PMs to triage issues, identify bottlenecks, and create new test cases on the fly to address real-world deployment challenges.
Be the Customer’s Voice: Conduct manual UAT and voice-call testing to represent the end-customer experience. You take it personally when an agent lacks empathy or clarity, and you excel at articulating these nuances to the engineering team and clients.
Lead and Scale the Team: lead a pod of QA analysts and partners. You will define the best practices, communication loops, and shared knowledge base that allow the QA function to scale alongside our rapidly growing product line.

What We’re Looking For:

5+ years of experience in Quality Engineering, Deployments, or Technical QA, ideally within an AI or high-growth SaaS environment.
Systems Thinking: A strong technical intuition and curiosity about how LLMs work. While you don't need to code, you must be comfortable navigating technical concepts like LLM, RAG, prompt logic, and multi-turn conversational flows.
Operational Leadership: Proven ability to large E2E technical projects through partners, and a passion for building processes that improve efficiency between QA, Engineering, and Product.
The "QA Nose": An uncanny ability to find the edge case and a bias toward action. You anticipate bottlenecks before they happen and deliver solutions with urgency.
High Empathy: A consultative mindset with the ability to represent the "human element" of a customer support interaction.
Startup Agility: You thrive in fast-paced environments, excel at turning ambiguity into execution, and are comfortable "rolling up your sleeves" to build.

Bonus Points:

Experience with CCaaS (Contact Center as a Service), telephony, or STT/TTS (Speech-to-Text) technologies.
Background in Conversation Design or SDET roles.
Experience leading team with direct reports.

Perks & Benefits:

We offer a comprehensive and people-first benefits package to support you at work and in life:

Comprehensive medical, dental, and vision coverage with plans to fit you and your family
Flexible PTO to take the time you need, when you need it
Paid parental leave for all new parents welcoming a new child
Retirement savings plan to help you plan for the future
Remote work setup budget to help you create a productive home office
Monthly wellness and communication stipend to keep you connected and balanced
In-office meal program and commuter benefits provided for onsite employees

Compensation at Cresta:

Cresta’s approach to compensation is simple: recognize impact, reward excellence, and invest in our people. We offer competitive, location-based pay that reflects the market and what each individual brings to the table.

Compensation for this position includes a Base salary + Bonus + Equity.

Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable. Your recruiter can provide further details. In addition, total compensation includes a comprehensive benefits package for you and your family.

We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored. If you are uncertain whether you have been contacted by an official Cresta employee, reach out to recruiting@cresta.ai