Staff Software Engineer, Inference Infrastructure

Cohere Cohere · AI Frontier · San Francisco, CA · Inference

Cohere is seeking a Staff Software Engineer to join their Model Serving team. This role focuses on developing, deploying, and operating the AI platform that delivers Cohere's large language models via API endpoints. The engineer will optimize NLP models for low latency, high throughput, and high availability, working with distributed systems, Kubernetes, and GPU workloads. Experience with cloud platforms and high-performance languages is required.

What you'd actually do

  1. developing, deploying, and operating the AI platform delivering Cohere's large language models through easy to use API endpoints
  2. work closely with many teams to deploy optimized NLP models to production in low latency, high throughput, and high availability environments
  3. interface with customers and create customized deployments to meet their specific needs

Skills

Required

  • 5+ years of engineering experience running production infrastructure at a large scale
  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
  • Experience with Kubernetes dev and production coding and support
  • Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
  • Experience in compute/storage/network resource and cost management
  • Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork
  • Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference.
  • Strong understanding or working experience with distributed systems.
  • Experience in Golang, C++ or other languages designed for high-performance scalable servers).

Nice to have

  • The grit and adaptability to solve complex technical challenges that evolve day to day

What the JD emphasized

  • running production infrastructure at a large scale
  • highly available distributed systems
  • Kubernetes
  • GPU workloads
  • Kubernetes dev and production coding and support
  • GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
  • complex Linux-based computing environments
  • compute/storage/network resource and cost management
  • computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference
  • distributed systems
  • Golang, C++ or other languages designed for high-performance scalable servers

Other signals

  • Deploying and operating AI platform
  • Large language models
  • API endpoints
  • Optimized NLP models
  • Low latency, high throughput, high availability environments