Engineering Manager, Site Reliability Engineering

Together AI Together AI · Data AI · San Francisco, CA · Engineering

Engineering Manager for Site Reliability Engineering (SRE) to lead a team of ~10 engineers responsible for Together AI's production infrastructure, including bare-metal GPU compute, public-cloud Kubernetes for inference, and Kubernetes with virtualization for virtual clusters. The role involves a mix of management (50-60%) and hands-on technical work (40-50%), focusing on shifting the team from reactive, manual operations to systemic, automation-first work, improving incident response, and developing engineers.

What you'd actually do

  1. Lead and develop a team of ~10 SRE engineers across multiple function areas, partnering with technical leads on direction.
  2. Drive the team's shift from manual operations to systemic, automated, scalable infrastructure -including making toil visible, capping it, and prioritizing engineering work that reduces it.
  3. Stay hands-on: code, review architecture, lead incidents, and participate meaningfully in technical decisions.
  4. Build coaching and feedback rhythms that develop engineers over time, especially around incident leadership, on-call habits, and systemic problem-solving.
  5. Strengthen on-call practices and incident response, including blameless postmortems that produce real engineering follow-through.

Skills

Required

  • Prior experience managing SRE, infrastructure, or platform engineering teams
  • Deep technical credibility in at least one of: bare-metal infrastructure with Ansible-based config management, Kubernetes on public cloud, or Kubernetes with virtualization
  • Strong Kubernetes and Terraform fundamentals, with hands-on production experience
  • Genuine player-coach orientation
  • Experience leading teams through serious production incidents and on-call rotations
  • Track record of coaching engineers and shifting team culture through engineering systems
  • Comfort operating in a matrix structure
  • Adaptability

Nice to have

  • time leading through a reliability or culture turnaround

What the JD emphasized

  • shift from reactive, manual operations to systemic, automation-first work
  • leading through a reliability or culture turnaround
  • coaching engineers and shifting team culture through engineering systems