What you'd actually do

Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.

Assess and overhaul support workflows, SLA frameworks, and escalation playbooks

Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.

Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.

Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.

Skills

Required

10+ years of support engineering or technical support leadership experience
at least 3 years managing a team
Demonstrated experience leading infrastructure support or cloud operations
Working knowledge of AI infrastructure
Ability to guide engineers through root cause analysis
Experience running SLA-driven support operations
Strong communication skills, especially under pressure
Startup mindset

Nice to have

Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.

About the Role

We’re hiring a Support Leader to own and scale Together AI’s customer support function across two distinct, technically demanding domains: API Support (billing, serverless inference, and dedicated inference) and GPU Support (large-scale GPU infrastructure for model training workloads). You’ll work closely with Together AI’s VP of Customer Experience and partner tightly with SRE, Inference Platform, and Engineering to represent customers internally and drive resolution at speed. This is a player-coach role: you’ll be hands-on in escalations.

Our support operation runs 24/7. Our GPU infrastructure customers hold us to high-stakes SLAs on training workloads. Our API customer base spans thousands of PLG and enterprise accounts relying on our serverless and dedicated inference endpoints. Both domains need a leader who can keep pace technically and build the operational muscle to scale.

Responsibilities

Team Leadership and Mentorship

Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.
Establish clear performance expectations, career growth paths, and a coaching culture leveraged to identify skill gaps and build training programs to close them.
Run structured 1:1s, team reviews, and escalation retrospectives.

Operationalization and Scaling

Assess and overhaul support workflows, SLA frameworks, and escalation playbooks
Build triage, prioritization, and handoff protocols that allow the team to scale with customer growth without proportional headcount growth.
Define and own support KPIs: SLA attainment, time-to-resolution, escalation rate, CSAT

GPU Infrastructure Support (Hands-On)

Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.
Manage high-stakes SLA obligations with GPU cloud customers running multi-thousand-GPU training workloads
Coordinate closely with SRE and infrastructure engineering on hardware-level issues and cluster bringup.

API and Inference Support (Hands-On)

Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.
Represent the team on complex cases: dedicated endpoint startup failures, safetensors validation errors, NFS/storage performance issues on inference clusters, billing disputes and negative-balance enforcement, and rate limit escalations.
Work with the Inference Platform, Commerce, and Product teams to surface patterns and drive fixes upstream.

Escalation and Cross-Functional Partnership

Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.
Partner with SRE, Engineering, and Sales on shared priorities. Represent the support team’s perspective in cross-functional planning.
Own the relationship with support tooling vendors and drive improvements to alerting, SLA tracking, and ticket routing.

Customer Feedback Loop

Systematically analyze ticket patterns and surface product and infrastructure gaps to Engineering and Product. Turn support signal into actionable roadmap input.
Build documentation and self-service resources that reduce inbound volume over time.

Requirements

10+ years of support engineering or technical support leadership experience, with at least 3 years managing a team.
Demonstrated experience leading infrastructure support or cloud operations. You understand how large-scale workloads behave on distributed systems.
Working knowledge of AI infrastructure. You know how APIs work, can reason about latency and throughput issues, and understand the operational surface of a managed inference platform.
Technical depth to be a credible player-coach. Ability to guide engineers through root cause analysis, and bring credibility to customer-facing escalations.
Experience running SLA-driven support operations with real accountability. Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.
Strong communication skills, especially under pressure. You can write a clear, concise customer-facing update in the middle of a live incident and distill a complex infrastructure issue into a crisp internal escalation.
Startup mindset. You’re comfortable building process where none exists, and you thrive in environments where priorities shift fast.

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $290,000 - $310,000K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at https://www.together.ai/privacy

About the Role

Responsibilities

Team Leadership and Mentorship

Directly manage and develop a team of support engineers and technical account specialists across API Support and GPU Support functions.
Establish clear performance expectations, career growth paths, and a coaching culture leveraged to identify skill gaps and build training programs to close them.
Run structured 1:1s, team reviews, and escalation retrospectives.

Operationalization and Scaling

Assess and overhaul support workflows, SLA frameworks, and escalation playbooks
Build triage, prioritization, and handoff protocols that allow the team to scale with customer growth without proportional headcount growth.
Define and own support KPIs: SLA attainment, time-to-resolution, escalation rate, CSAT

GPU Infrastructure Support (Hands-On)

Jump into complex, active GPU infrastructure issues alongside your team. Investigate NCCL and InfiniBand failures, SSH connection stalls, Kubelet TLS misconfigurations, GPU/RDMA provisioning timeouts, NFS RDMA mount failures, VAST storage failures, network fabric degradation, etc.
Manage high-stakes SLA obligations with GPU cloud customers running multi-thousand-GPU training workloads
Coordinate closely with SRE and infrastructure engineering on hardware-level issues and cluster bringup.

API and Inference Support (Hands-On)

Own the support surface for Together AI’s API platform: serverless inference, dedicated inference endpoints (self-serve and managed), billing, rate limits, model upload (BYOM), and API authentication.
Represent the team on complex cases: dedicated endpoint startup failures, safetensors validation errors, NFS/storage performance issues on inference clusters, billing disputes and negative-balance enforcement, and rate limit escalations.
Work with the Inference Platform, Commerce, and Product teams to surface patterns and drive fixes upstream.

Escalation and Cross-Functional Partnership

Be the escalation point for your team’s highest-severity customer issues — triage fast, communicate clearly to customers and internal stakeholders, and drive to resolution.
Partner with SRE, Engineering, and Sales on shared priorities. Represent the support team’s perspective in cross-functional planning.
Own the relationship with support tooling vendors and drive improvements to alerting, SLA tracking, and ticket routing.

Customer Feedback Loop

Systematically analyze ticket patterns and surface product and infrastructure gaps to Engineering and Product. Turn support signal into actionable roadmap input.
Build documentation and self-service resources that reduce inbound volume over time.

Requirements

10+ years of support engineering or technical support leadership experience, with at least 3 years managing a team.
Demonstrated experience leading infrastructure support or cloud operations. You understand how large-scale workloads behave on distributed systems.
Working knowledge of AI infrastructure. You know how APIs work, can reason about latency and throughput issues, and understand the operational surface of a managed inference platform.
Technical depth to be a credible player-coach. Ability to guide engineers through root cause analysis, and bring credibility to customer-facing escalations.
Experience running SLA-driven support operations with real accountability. Familiarity with Pylon or equivalent support ticketing platforms (Zendesk, etc.) and PagerDuty-style alerting systems.
Strong communication skills, especially under pressure. You can write a clear, concise customer-facing update in the middle of a live incident and distill a complex infrastructure issue into a crisp internal escalation.
Startup mindset. You’re comfortable building process where none exists, and you thrive in environments where priorities shift fast.

Director, Support Engineering

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About the Role

Responsibilities

Requirements

About Together AI

Compensation

Equal Opportunity

About the Role

Responsibilities

Requirements

About Together AI

Compensation

Equal Opportunity

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About the Role

**Responsibilities **

Requirements

About Together AI

Compensation

Equal Opportunity

About the Role

**Responsibilities **

Requirements

About Together AI

Compensation

Equal Opportunity

Responsibilities

Responsibilities