New Applied Ai, Evaluation Engineer

Mistral AI Mistral AI · AI Frontier · Paris, France · Solutions

Mistral AI is seeking an Applied AI Evaluation Engineer to design and implement evaluation systems for LLMs and agentic systems for enterprise clients. This role focuses on creating production-grade, domain-specific evaluations to ensure model performance and reliability, bridging the gap between research and customer needs.

What you'd actually do

  1. Design and implement comprehensive evaluation frameworks to measure LLM capabilities across diverse customer use cases, including text generation, reasoning, code, and domain-specific applications
  2. Build scalable evaluation infrastructure and pipelines that enable rapid, reproducible assessment of model performance
  3. Develop novel evaluation methodologies to assess emerging capabilities or verticalized use cases (cybersecurity, finance, healthcare, etc.) and enable the Solutions (Deployment Strategist and Applied AI) on these topics.
  4. Create custom evaluation suites tailored to enterprise customers' specific needs, working closely with them to understand their requirements and success criteria
  5. Collaborate with research teams to translate evaluation insights into model improvements and training decisions

Skills

Required

  • 3+ years of experience in ML evaluation, benchmarking for LLM or agentic systems
  • Proven experience in AI or machine learning product implementation with APIs, back-end
  • Deep understanding of concepts and algorithms underlying machine learning and LLMs
  • Strong technical coding skills in Python
  • Strong communication skills with an ability to explain complex technical concepts in simple terms with technical and non-technical audiences

Nice to have

  • Contributions to open-source evaluation frameworks (e.g., LM Eval Harness, OpenAI Evals) or published research on LLM evaluation
  • Experience as a Customer Engineer, Forward Deployed Engineer, Sales Engineer, Solutions Architect or Technical Product Manager
  • Experience with ML frameworks (PyTorch, HuggingFace Transformers)

What the JD emphasized

  • customer-facing
  • evaluation systems
  • production-grade
  • domain-specific
  • risk-aware
  • LLM evaluation
  • benchmarking for LLM or agentic systems
  • AI or machine learning product implementation

Other signals

  • customer-facing
  • evaluation systems
  • production-grade AI