Mistral Cloud - Site Reliability Engineer

Mistral AI Mistral AI · AI Frontier · Paris, France · Engineering & Infra

Seeking experienced Site Reliability Engineers (SRE) to ensure the reliability, scalability, and performance of Mistral AI's Cloud platform and customer-facing applications. Responsibilities include designing and maintaining highly available infrastructures, operating production systems, implementing monitoring and alerting, managing CI/CD and orchestration tools, and collaborating with engineering and security teams. The role involves driving infrastructure automation, supporting model-training experiments, and contributing to the cloud platform's abstraction layer.

What you'd actually do

  1. Design, build, and maintain scalable, highly available and fault-tolerant infrastructures
  2. Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
  3. Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
  4. Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our customer-facing APIs and large training runs
  5. Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences

Skills

Required

  • Master’s degree in Computer Science, Engineering or a related field
  • 5+ years of experience in a DevOps/SRE role
  • Strong experience with bare metal infrastructure and highly available distributed systems
  • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
  • Experience working against reliability KPIs (observability, alerting, SLAs)
  • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
  • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
  • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
  • Proficiency in scripting languages (Python, Go, Bash...)
  • Knowledge of software development best practices
  • Strong understanding of networking, security, and system administration concepts
  • Excellent problem-solving and communication skills
  • Self-motivated and able to work well in a fast-paced startup environment

Nice to have

  • experience in an AI/ML environment
  • experience of high-performance computing (HPC) systems and workload managers (Slurm)
  • worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

What the JD emphasized

  • highly experienced Site Reliability Engineers
  • highly available
  • fault-tolerant infrastructures
  • production environments
  • monitoring, alerting, and incident response systems
  • customer-facing APIs
  • large training runs
  • on-call rotations
  • infrastructure automation
  • model-training experiments
  • reliability, availability and performance
  • best security practices and compliance requirements
  • 5+ years of experience in a DevOps/SRE role
  • bare metal infrastructure
  • highly available distributed systems
  • site reliability issues in critical environments
  • reliability KPIs
  • observability, alerting, SLAs
  • CI/CD, containerization and orchestration tools
  • monitoring, logging, alerting and observability tools
  • infrastructure-as-code tools
  • scripting languages (Python, Go, Bash...)
  • networking, security, and system administration concepts
  • problem-solving and communication skills