Member of Technical Staff, Training Infra Engineer

Cohere Cohere · AI Frontier · Paris, France · Modeling

Cohere is seeking a Member of Technical Staff, Training Infra Engineer to design and write high-performant and scalable software for model training. This role will improve training setup, craft tools to speed up training cycles, and research/implement ideas on supercompute and data infrastructure. The ideal candidate has strong software engineering skills, proficiency in Python and ML frameworks like JAX/PyTorch, and experience with distributed training infrastructures and large-scale distributed training strategies.

What you'd actually do

  1. Design and write high-performant and scalable software for training.
  2. Improve our training setup from an infrastructure and codebase performance standpoint.
  3. Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  4. Research, implement, and experiment with ideas on our supercompute and data infrastructure.

Skills

Required

  • Extremely strong software engineering skills.
  • Proficiency in Python and related ML frameworks such as JAX, PyTorch and XLA/MLIR.
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray).
  • Experience using large-scale distributed training strategies.
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Nice to have

  • paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

What the JD emphasized

  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Other signals

  • training infra
  • large scale distributed training
  • supercompute