Software Engineer, Internal Infrastructure (europe & Uk)

Cohere Cohere · AI Frontier · United Kingdom · Product

Software Engineer on the internal infrastructure team responsible for building and operating Kubernetes GPU superclusters across multiple clouds to support AI workloads for training foundational models. This role involves partnering with cloud providers, designing scalable systems, and ensuring stability and efficiency for research teams.

What you'd actually do

  1. Build and operate Kubernetes compute superclusters across multiple clouds
  2. Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
  3. Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
  4. Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
  5. Encourage software best practices across our company and participate in team processes such as knowledge sharing, reviews, and on-call

Skills

Required

  • running Kubernetes clusters at scale
  • scaling and troubleshooting Cloud Native infrastructure
  • Infrastructure as Code
  • Go or Python

Nice to have

  • ML training infrastructure
  • GPU workloads
  • RDMA networking
  • support and troubleshoot low level Linux systems
  • collaborating with research teams or machine learning engineers

What the JD emphasized

  • participating in a 24x7 on-call rotation
  • deep experience running Kubernetes clusters at scale
  • scaling and troubleshooting Cloud Native infrastructure
  • Infrastructure as Code

Other signals

  • building and operating Kubernetes GPU superclusters
  • optimize infrastructure costs, performance, and reliability for AI workloads
  • design and build resilient, scalable systems for training AI models