Member of Technical Staff, Model Efficiency

Cohere Cohere · AI Frontier · New York, NY · Modeling

Cohere is seeking an engineer to improve LLM inference efficiency by optimizing model execution, reducing latency and increasing throughput. This role involves deep dives into model execution, identifying bottlenecks, and developing optimizations across the inference stack, including GPU/CUDA and kernel-level improvements.

What you'd actually do

  1. building reliable ML systems and pushing the boundaries of LLM inference efficiency
  2. develop techniques that improve how models execute in production, driving lower latency, higher throughput, and consistent quality across diverse workloads
  3. work across the inference stack to improve core performance metrics by diving deep into model execution, identifying bottlenecks, and developing innovative optimizations
  4. collaborate closely with modeling and systems teams to experiment, measure, and ship improvements that meaningfully accelerate inference
  5. build expertise in advanced performance techniques, including GPU/CUDA optimizations, kernel-level improvements, and model execution strategies for MoE and large-scale architectures

Skills

Required

  • C++
  • Python
  • large language models
  • LLM inference ecosystem
  • performance bottlenecks
  • model execution stack

Nice to have

  • Rust
  • Go
  • GPU programming
  • CUDA
  • low-level systems optimization
  • Language modeling with transformers
  • MoE
  • speculative decoding
  • KV-cache optimizations
  • Scaling performance-critical distributed systems

What the JD emphasized

  • 5+ years of experience writing high-performance, production-quality code
  • Strong programming skills in C++ or Python
  • Experience working with large language models and familiarity with the LLM inference ecosystem (e.g., vLLM, SGLang, etc.)
  • Ability to diagnose and resolve performance bottlenecks across the model execution stack
  • A strong bias for action — you ship fast, measure impact, and iterate

Other signals

  • LLM inference efficiency
  • low latency
  • higher throughput
  • model execution
  • GPU/CUDA optimizations
  • kernel-level improvements