Senior Site Reliability Engineer

Stability AI Stability AI · AI Frontier · Remote · Technical

Senior Site Reliability Engineer to join Stability AI's Engineering Operations team, focusing on improving and shaping cloud infrastructure. The role involves working with various teams to drive innovation and reliability, building and improving a maturing cloud landscape.

What you'd actually do

  1. Developing and enforcing SRE best practices and standards across the organization.
  2. Architecting and managing scalable systems in AWS and other cloud environments, focusing on high availability and resilience.
  3. Implementing and maintaining infrastructure as code using Terraform.
  4. Setting up and refining monitoring, logging, and alerting systems.
  5. Driving incident management and root cause analysis to improve system reliability.

Skills

Required

  • SRE best practices
  • scalable systems
  • AWS
  • Terraform
  • monitoring, logging, and alerting
  • incident management
  • CI/CD pipelines
  • Kubernetes
  • software development or automation scripting
  • Grafana, ELK stack, or similar tools
  • Cloud security

Nice to have

  • mentoring junior team members