Senior ML Systems Engineer, Frameworks & Tooling

Cohere Cohere · AI Frontier · London, United Kingdom · Modeling

Cohere is seeking a Senior ML Systems Engineer to build, maintain, and evolve the training framework for their frontier-scale language models. This role focuses on large-scale training, distributed systems, and HPC infrastructure, designing core components for fast, reliable, and scalable model training and developing tooling to connect research to GPU clusters. The position emphasizes full-stack ML systems ownership and impact.

What you'd actually do

  1. Build and own the training framework responsible for large-scale LLM training.
  2. Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
  3. Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
  4. Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
  5. Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.

Skills

Required

  • large-scale distributed training
  • HPC systems
  • JAX internals
  • distributed training libraries
  • custom kernels/fused ops
  • multi-node cluster orchestration
  • CUDA/NCCL debugging
  • networking debugging
  • IO debugging
  • data pipelines debugging
  • containerized environments
  • ML systems stack performance debugging
  • reproducible systems
  • debuggable systems
  • collaboration skills

Nice to have

  • training LLMs
  • transformer architectures
  • ML frameworks contributions
  • evaluation frameworks
  • serving frameworks
  • data pipeline optimization
  • sharded datasets
  • caching strategies
  • performance engineering
  • profiling
  • low-level systems
  • paper at top-tier venues

What the JD emphasized

  • Strong engineering experience in large-scale distributed training or HPC systems.
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.

Other signals

  • training framework
  • frontier-scale language models
  • distributed systems
  • HPC infrastructure
  • thousands of GPUs