Lead/manager Site Reliability Engineering Team (amsterdam)

Together AI Together AI · Data AI · EUROPE · Engineering

Lead a team of Site Reliability Engineers (SRE) responsible for keeping user-facing services and production systems running smoothly. The role involves managing, developing, and coaching the SRE team, building and running infrastructure using Ansible, Terraform, and Kubernetes, implementing monitoring systems, designing operational processes, debugging production issues, and planning infrastructure growth. The company is an AI research company, but this role is focused on the underlying infrastructure and operations, not direct AI/ML model development or research.

What you'd actually do

  1. Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
  2. Manage, develop and coach the SRE Team.
  3. Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  4. Build monitoring systems to ensure the highest quality service for our customers
  5. Design and implement operational processes (such as deployments and upgrades)

Skills

Required

  • 7+ years of professional SRE or related experience
  • Ideally 2 years as a Lead SRE
  • Bachelor's degree in Computer Science or a related field or equivalent work experience
  • Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
  • Proficiency in programming/scripting languages
  • Direct experience in monitoring and observability practices
  • Advanced knowledge of cloud services
  • Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts