Research Engineer, Data Infrastructure

Mistral AI Mistral AI · AI Frontier · Palo Alto, CA · Research

This role focuses on building and operating the next generation of data infrastructure at Mistral AI, including massive compute fleets and storage systems for high performance and scalability. The role involves designing and scaling big data compute and storage platforms, ensuring secure and governed data access for MLOps and research, and taking full lifecycle ownership of pipelines and critical training jobs.

What you'd actually do

  1. Build & Scale: Help us reach our goal of operating massive distributed compute and storage systems
  2. Global Orchestration: Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions.
  3. Design Future-Proof Storage: Architect our transition to modern storage formats to handle fine-tuning datasets at a scale that anticipates exabyte growth.
  4. Platform Engineering: Contribute to the development of our internal training platform, ensuring seamless model training and fine-tuning capabilities across Kubernetes and SLURM based environments.
  5. Metadata & Lineage: Implement and manage systems to provide clear visibility and lineage as our data and model pipelines grow in complexity.

Skills

Required

  • 4+ years of experience in Data Infrastructure, MLOps, or Infrastructure Engineering
  • Python
  • Kubernetes-native tooling
  • building and operating scalable, reliable, and secure systems

Nice to have

  • experience or a strong interest in supporting foundational compute and storage platforms
  • modern, columnar storage standards
  • debugging large-scale distributed systems across multi-cluster environments
  • ambiguity and the challenges of building high-scale infrastructure in a rapid-growth AI environment

What the JD emphasized

  • massive distributed compute and storage systems
  • exabyte growth
  • Kubernetes-native tooling
  • large-scale distributed systems

Other signals

  • data infrastructure
  • MLOps
  • training data
  • exabyte growth
  • Kubernetes