Engineering Manager, Model Routing & Inference

Cursor Cursor · Coding AI · San Francisco, CA · Engineering

Engineering Manager to lead the Model Routing & Inference team, owning the inference platform that powers all AI interactions in the product. Responsibilities include setting technical direction for cluster management, inference optimization, and traffic egress, managing GPU utilization, capacity planning, and designing routing mechanisms. The role involves leading a team, hiring, and coaching engineers.

What you'd actually do

  1. Building and evolving our inference gateway, a single abstraction over every provider's API semantics, so model onboarding becomes a config change.
  2. Building the systems that dynamically select the best model for each request based on cost, latency, and quality.
  3. Managing GPU cluster utilization and capacity planning across providers, optimizing for cost and performance.
  4. Designing routing backpressure and admission control so traffic spikes don't cascade into providers.
  5. Hiring and growing the team: sourcing, interviewing, and closing top inference and systems talent, while developing your engineers through coaching, mentorship, and high-leverage project assignments.

Skills

Required

  • leading engineering teams
  • building high-throughput, low-latency distributed systems
  • inference serving
  • traffic routing
  • real-time data pipelines
  • reasoning about cost/performance tradeoffs at scale
  • GPU utilization
  • provider economics
  • capacity planning
  • strong software engineering fundamentals
  • shipping production systems
  • millions of requests

Nice to have

  • model serving frameworks (vLLM, TensorRT-LLM, TGI)
  • load balancing
  • building resilient multi-provider architectures

What the JD emphasized

  • high-throughput, low-latency distributed systems
  • inference serving
  • traffic routing
  • cost/performance tradeoffs at scale
  • millions of requests
  • model serving frameworks
  • resilient multi-provider architectures
  • reliability, cost, latency, and user experience

Other signals

  • inference platform
  • model routing
  • GPU cluster utilization
  • millions of daily requests
  • low-latency distributed systems