AI Infrastructure Engineer (sre) Amsterdam

Together AI Together AI · Data AI · EUROPE · Engineering

AI infrastructure Engineer (SRE) responsible for keeping user-facing services and production systems running smoothly, specializing in systems, availability, reliability, and scalability. The role involves building and running infrastructure with Ansible, Terraform, and Kubernetes, implementing monitoring and observability, and debugging production issues.

What you'd actually do

  1. Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
  2. Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  3. Build monitoring systems to ensure the highest quality service for our customers
  4. Design and implement operational processes (such as deployments and upgrades)
  5. Debug production issues across all services and levels of the stack

Skills

Required

  • Ansible
  • Terraform
  • Kubernetes
  • programming/scripting languages
  • monitoring
  • observability
  • cloud services

Nice to have

  • SRE
  • systems (operating systems, storage subsystems, networking)
  • algorithms
  • distributed systems

What the JD emphasized

  • 7+ years of professional SRE or related experience