Director, Support Engineering

Together AI Together AI · Data AI · San Francisco, CA · Customer Success

This role leads and scales the customer support function for Together AI, focusing on both API support (serverless/dedicated inference, billing) and GPU support (large-scale training infrastructure). It's a player-coach position requiring hands-on involvement in complex escalations, managing support engineers, defining KPIs, and improving support workflows and tooling. The role requires strong technical depth in AI infrastructure, distributed systems, and experience with SLA-driven operations.

What you'd actually do

  1. Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.
  2. Assess and overhaul support workflows, SLA frameworks, and escalation playbooks
  3. Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.
  4. Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.
  5. Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.

Skills

Required

  • 10+ years of support engineering or technical support leadership experience
  • at least 3 years managing a team
  • Demonstrated experience leading infrastructure support or cloud operations
  • Working knowledge of AI infrastructure
  • Ability to guide engineers through root cause analysis
  • Experience running SLA-driven support operations
  • Strong communication skills, especially under pressure
  • Startup mindset

Nice to have

  • Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.

What the JD emphasized

  • high-stakes SLAs
  • technical depth to be a credible player-coach
  • SLA-driven support operations
  • startup mindset