Member of Technical Staff, Training Performance Engineer

Cohere Cohere · AI Frontier · London, United Kingdom · Modeling

Cohere is seeking a Performance Engineer for their Pre-Training team to optimize the performance of advanced language models and systems, focusing on improving training throughput and accelerator utilization. The role involves designing and writing high-performance software, including low-level CUDA/Triton kernels, and researching ideas on supercompute and data infrastructure.

What you'd actually do

  1. Design and write high-performant and scalable software for training.
  2. Understand architectural modifications and design choices and their effects on training throughput and quality.
  3. Write low-level CUDA, triton kernels to squeeze every last bit of performance from our accelerators.
  4. Research, implement, and experiment with ideas on our supercompute and data infrastructure.
  5. Learn from and work with the best researchers in the field.

Skills

Required

  • Extremely strong software engineering skills.
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR.
  • Experience writing kernels for GPUs using CUDA, triton, etc
  • Experience using large-scale distributed training strategies.
  • Familiarity with autoregressive sequence models, such as Transformers.

Nice to have

  • paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

What the JD emphasized

  • improving key model training metrics
  • high accelerator utilization
  • low-level CUDA, triton kernels

Other signals

  • optimizing performance of advanced language models
  • improving key model training metrics
  • high accelerator utilization
  • low-level CUDA, triton kernels