Applied Ai, Evaluation Engineer

Mistral AI Mistral AI · AI Frontier · Paris, France · Solutions

This role focuses on designing and implementing evaluation systems and infrastructure for LLMs, specifically for enterprise clients. The goal is to measure model performance across customer-specific use cases, moving beyond general benchmarks to domain-specific, risk-aware evaluations. The role involves building scalable pipelines, developing new methodologies, and tailoring evaluations to customer needs, bridging research, engineering, and customer-facing teams.

What you'd actually do

  1. Design and implement comprehensive evaluation frameworks to measure LLM capabilities across diverse customer use cases, including text generation, reasoning, code, and domain-specific applications
  2. Build scalable evaluation infrastructure and pipelines that enable rapid, reproducible assessment of model performance
  3. Develop novel evaluation methodologies to assess emerging capabilities or verticalized use cases (cybersecurity, finance, healthcare, etc.) and enable the Solutions (Deployment Strategist and Applied AI) on these topics.
  4. Create custom evaluation suites tailored to enterprise customers' specific needs, working closely with them to understand their requirements and success criteria
  5. Collaborate with research teams to translate evaluation insights into model improvements and training decisions

Skills

Required

  • Python
  • ML evaluation
  • benchmarking for LLM or agentic systems
  • AI or machine learning product implementation with APIs, back-end
  • deep understanding of concepts and algorithms underlying machine learning and LLMs
  • strong communication skills

Nice to have

  • Contributions to open-source evaluation frameworks (e.g., LM Eval Harness, OpenAI Evals)
  • published research on LLM evaluation
  • Customer Engineer, Forward Deployed Engineer, Sales Engineer, Solutions Architect or Technical Product Manager
  • ML frameworks (PyTorch, HuggingFace Transformers)

What the JD emphasized

  • customer-facing technical organization
  • design the methodology, build the infrastructure, and define what "ready for production" means
  • design and implement evaluation systems
  • build robust evaluation infrastructure
  • customer reality domain-specific, risk-aware, production-grade
  • 3+ years of experience in ML evaluation, benchmarking for LLM or agentic systems
  • proven experience in AI or machine learning product implementation

Other signals

  • design and implement evaluation frameworks
  • build scalable evaluation infrastructure
  • develop novel evaluation methodologies
  • create custom evaluation suites
  • ML evaluation, benchmarking for LLM or agentic systems