Engineering Manager, Model Serving

Together AI Together AI · Data AI · San Francisco, CA · Engineering

Engineering Manager for Together AI's Model Serving platform, focusing on delivering world-class inference and fine-tuning in public APIs and customer deployments. Responsibilities include owning SLAs, improving testing/deployment/monitoring, building self-serve tooling, defining configuration best practices for inference engines, leading incident response, and mentoring team members. Requires 5+ years operating production ML inference or training systems at scale and 2+ years in senior IC or tech lead roles, with deep expertise in Kubernetes, multi-cluster orchestration, and ML serving frameworks.

What you'd actually do

  1. Own availability and performance SLAs for production inference and fine-tuning services across serverless and dedicated deployments
  2. Own & improve testing, deployment, configuration management, and monitoring practices for multi-cluster ML infrastructure – partnering closely with Infra SREs
  3. Build self-serve tooling and automation to reduce operational toil and enable self-serve offerings.
  4. Define and enforce configuration best practices for inference engines (SGLang, TRT-LLM, vLLM etc.) to prevent runtime issues
  5. Lead incident response, conduct postmortems, and drive reliability improvements

Skills

Required

  • 5+ years operating production ML inference or training systems at scale
  • 2+ years in senior IC or tech lead roles, with demonstrated mentorship and technical leadership experience
  • Deep expertise with Kubernetes, multi-cluster orchestration, and ML serving frameworks
  • Experience with multi-tenant SaaS platforms
  • Proven track record of SLA ownership with specific metrics (99.9% uptime, p99 latency targets)
  • Customer escalation and incident communication experience
  • Experience with LLM inference serving systems (SGLang, vLLM, TRT-LLM, or similar)
  • Ability to influence cross-functional teams and make deployment/architecture decisions

Nice to have

  • Experience building internal developer platforms or self-serve tooling
  • Background in cost optimization for GPU infrastructure
  • Contributions to open-source ML infrastructure projects
  • Having built or scaled teams is a plus

What the JD emphasized

  • operating production ML inference or training systems at scale
  • SLA ownership with specific metrics (99.9% uptime, p99 latency targets)
  • LLM inference serving systems

Other signals

  • ML API offerings
  • inference and fine-tuning
  • production scale
  • multi-cluster deployment
  • LLM inference serving systems