Software Engineer, Internal Infrastructure (north America)

Cohere Cohere · AI Frontier · Toronto, ON · Product

Software Engineer focused on building and operating internal infrastructure for training, evaluating, and serving foundational AI models. This includes managing Kubernetes GPU superclusters, optimizing cloud infrastructure for AI workloads, and designing scalable systems for model training, with a strong emphasis on stability, scalability, and observability.

What you'd actually do

  1. Build and operate Kubernetes compute superclusters across multiple clouds
  2. Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
  3. Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
  4. Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
  5. Encourage software best practices across our company and participate in team processes such as knowledge sharing, reviews, and on-call

Skills

Required

  • running Kubernetes clusters at scale
  • scaling and troubleshooting Cloud Native infrastructure
  • Infrastructure as Code
  • Go or Python
  • contributing to Open Source solutions

Nice to have

  • ML training infrastructure
  • GPU workloads
  • RDMA networking
  • low level Linux systems
  • collaborating with research teams or machine learning engineers

What the JD emphasized

  • participating in a 24x7 on-call rotation

Other signals

  • building and operating Kubernetes GPU superclusters
  • optimize infrastructure costs, performance, and reliability for AI workloads
  • design and build resilient, scalable systems for training AI models