Staff Research Engineer, Model Efficiency

Cohere Cohere · AI Frontier · New York, NY · Modeling

Cohere is seeking a Staff Research Engineer focused on Model Efficiency to push the limits of LLM inference efficiency. This role involves exploring and shipping breakthroughs in model architecture, routing optimization, decoding algorithms, software/hardware co-design for GPU acceleration, and performance optimization without compromising model quality. The goal is to improve how fast and efficiently their foundation models run in production.

What you'd actually do

  1. Develop, prototype, and deploy techniques that materially improve how fast and efficiently our models run in production.
  2. Explore and ship breakthroughs across the model execution stack, including: model architecture and MoE routing optimization, decoding and inference-time algorithm improvements, software/hardware co-design for GPU acceleration, performance optimization without compromising model quality.

Skills

Required

  • PhD in Machine Learning or a related field
  • Understanding of LLM architecture
  • Experience optimizing LLM inference given resource constraints
  • Significant experience with techniques that enhance model efficiency
  • Strong software engineering skills
  • Publications at top-tier conferences and venues (ICLR, ACL, NeurIPS)

Nice to have

  • Appetite to work in a fast-paced high-ambiguity start-up environment
  • Passion to mentor others

What the JD emphasized

  • LLM inference efficiency
  • model architecture
  • optimize LLM inference
  • enhance model efficiency
  • performance optimization

Other signals

  • LLM inference efficiency
  • model architecture optimization
  • decoding and inference-time algorithm improvements
  • software/hardware co-design for GPU acceleration
  • performance optimization