Staff Software Engineer, GPU Infrastructure (hpc)

Cohere Cohere · AI Frontier · Canada · Product

Staff Software Engineer focused on building and scaling ML-optimized HPC infrastructure, including Kubernetes-based GPU/TPU superclusters, optimizing for AI/ML training cost efficiency, reliability, and performance, and enabling researchers with self-service tools.

What you'd actually do

  1. Build and scale ML-optimized HPC infrastructure
  2. Optimize for AI/ML training
  3. Troubleshoot and resolve complex issues
  4. Enable researchers with self-service tools
  5. Drive innovation in ML infrastructure

Skills

Required

  • Python
  • Go
  • Kubernetes
  • GPU/TPU clusters
  • distributed training frameworks (JAX, PyTorch, TensorFlow)
  • HPC environments
  • Linux internals
  • RDMA networking
  • performance optimization for ML workloads

Nice to have

  • open-source contributions

What the JD emphasized

  • Kubernetes-based GPU/TPU superclusters
  • cost efficiency, reliability, and performance
  • RDMA, NCCL, and high-speed interconnects
  • monitor, debug, and optimize
  • JAX, PyTorch, distributed training
  • observability, automation, and infrastructure-as-code (IaC)
  • GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale
  • Python
  • Go
  • Linux internals, RDMA networking, and performance optimization
  • AI researchers or ML engineers
  • identify bottlenecks, propose solutions, and drive impact

Other signals

  • building and operating superclusters across multiple clouds
  • support their AI workload needs on the cutting edge
  • stability, scalability, and observability