About Mistral

Mistral provides full-stack AI solutions: from frontier models to developer tools, applications, and compute. We partner with enterprises tackling the hardest problems—across high-stakes industries like finance, manufacturing, defense, healthcare, and the public sector—co-creating customized AI systems that they can run on their terms.

We are a dynamic, collaborative team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between Europe, North America, Asia and the Middle East. We are creative, low-ego and team-spirited.

About The Job

The Applied AI team is Mistral's customer-facing technical organization. We work directly with enterprise clients from pre-sales through implementation to deploy cutting-edge AI solutions that deliver measurable business impact. Our team combines deep ML expertise with strong customer engagement skills, operating like startup CTOs who own end-to-end project execution.

However, the AI graveyard is full of great ideas nobody could measure or prototypes that never made it to production. As a first Evaluation Engineer, you'll design the methodology, build the infrastructure, and define what "ready for production" means across verticals and use cases.

You will design and implement evaluation systems that help our customers understand model performance across their specific use cases, build robust evaluation infrastructure, and work closely with both research and customer-facing teams.

Research builds evals for frontier capabilities but customers don't care about MMLU scores. We need in Applied AI evals and frameworks for customer reality domain-specific, risk-aware, production-grade. The kind that tell you whether your medical summarization model will hallucinate drug interactions, or whether your legal assistant will invent case citations.

This role sits at the intersection of research, engineering, and solutions, you will play a critical cross role in measuring, understanding, and improving the capabilities of our models for our enterprise customers.

What you will do

- Design and implement comprehensive evaluation frameworks to measure LLM capabilities across diverse customer use cases, including text generation, reasoning, code, and domain-specific applications

-** Build scalable evaluation infrastructure and pipelines **that enable rapid, reproducible assessment of model performance

-** Develop novel evaluation methodologies **to assess emerging capabilities or verticalized use cases (cybersecurity, finance, healthcare, etc.) and enable the Solutions (Deployment Strategist and Applied AI) on these topics.

**- Create custom evaluation suites **tailored to enterprise customers' specific needs, working closely with them to understand their requirements and success criteria

-** Collaborate with research teams **to translate evaluation insights into model improvements and training decisions

- Partner with product teams to continuously improve our evaluation tooling based on customer feedback

How We Work in Applied AI

We care about people and outputs.
What matters is what you ship, not the time you spend on it
Bureaucracy is where urgency goes to vanish. You talk to whoever you need to talk to. The best idea wins, whether it comes from a principal engineer or someone in their first week.
Always ask why. The best solutions come from deep understanding, not from copying what worked before
We say what we mean. Feedback is direct, timely, and given because we care.
No politics. Low ego, high standards.
We embrace an unstructured environment and find joy in it.

About you

You are fluent in English
3+ years of experience in ML evaluation, benchmarking for LLM or agentic systems
You have proven experience in AI or machine learning product implementation with APIs, back-end
You have deep understanding of concepts and algorithms underlying machine learning and LLMs
You have strong technical coding skills in Python
You hold strong communication skills with an ability to explain complex technical concepts in simple terms with technical and non-technical audiences

Ideally you have:

Contributions to open-source evaluation frameworks (e.g., LM Eval Harness, OpenAI Evals) or published research on LLM evaluation
Experience as a Customer Engineer, Forward Deployed Engineer, Sales Engineer, Solutions Architect or Technical Product Manager
Experience with ML frameworks (PyTorch, HuggingFace Transformers)

What we offer

We offer a comprehensive benefits package designed to support your well-being, growth, and work-life balance. Benefits vary by country and may include healthcare coverage, parental leave, retirement plans, relocation support, wellness programs, meal and transportation allowances, and other location-specific perks.

For the most up-to-date details on benefits available in your location, please refer to our Benefits page.

Privacy Policy

Your privacy matters to us. You can learn more about how we handle your personal data in our Applicant Privacy Policy.

About Mistral

About The Job

What you will do

-** Build scalable evaluation infrastructure and pipelines **that enable rapid, reproducible assessment of model performance

**- Create custom evaluation suites **tailored to enterprise customers' specific needs, working closely with them to understand their requirements and success criteria

-** Collaborate with research teams **to translate evaluation insights into model improvements and training decisions

- Partner with product teams to continuously improve our evaluation tooling based on customer feedback

How We Work in Applied AI

We care about people and outputs.
What matters is what you ship, not the time you spend on it
Bureaucracy is where urgency goes to vanish. You talk to whoever you need to talk to. The best idea wins, whether it comes from a principal engineer or someone in their first week.
Always ask why. The best solutions come from deep understanding, not from copying what worked before
We say what we mean. Feedback is direct, timely, and given because we care.
No politics. Low ego, high standards.
We embrace an unstructured environment and find joy in it.

About you

You are fluent in English
3+ years of experience in ML evaluation, benchmarking for LLM or agentic systems
You have proven experience in AI or machine learning product implementation with APIs, back-end
You have deep understanding of concepts and algorithms underlying machine learning and LLMs
You have strong technical coding skills in Python
You hold strong communication skills with an ability to explain complex technical concepts in simple terms with technical and non-technical audiences

Ideally you have:

Contributions to open-source evaluation frameworks (e.g., LM Eval Harness, OpenAI Evals) or published research on LLM evaluation
Experience as a Customer Engineer, Forward Deployed Engineer, Sales Engineer, Solutions Architect or Technical Product Manager
Experience with ML frameworks (PyTorch, HuggingFace Transformers)

What we offer

For the most up-to-date details on benefits available in your location, please refer to our Benefits page.

Privacy Policy

Your privacy matters to us. You can learn more about how we handle your personal data in our Applicant Privacy Policy.

New Applied Ai, Evaluation Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

About Mistral

What we offer

Privacy Policy

About Mistral

What we offer

Privacy Policy