Staff Site Reliability Engineer

Snyk Snyk · Enterprise · Lisbon, Portugal

Staff Site Reliability Engineer to support Snyk's API & Web by building scalable, reliable, and secure cloud infrastructure. Responsibilities include ensuring high availability, leading architectural discussions, driving infrastructure improvements, and collaborating with development teams. Requires experience with AWS, Kubernetes, security services, IaC tools, CI/CD, and scripting.

What you'd actually do

  1. Ensuring high availability, scalability, and disaster recovery across all systems.
  2. Leading architectural discussions and making strategic decisions related to scalability, security, and availability.
  3. Driving continuous improvement of our infrastructure, deployment, and monitoring processes.
  4. Collaborating with development and operations teams to improve deployment processes and infrastructure resiliency.
  5. Acting as a subject-matter expert for the SRE team and cross-functional engineering groups.

Skills

Required

  • Experience with AWS
  • Deep understanding of Kubernetes architecture and day-to-day cluster management
  • Experience with Security Services/ Internet Infrastructure providers, e.g. Cloudflare
  • Proficiency in alerting and monitoring tools
  • Proficiency with Infrastructure as Code tools (Terraform, Kustomize and Helm)
  • Experience with CI/CD pipelines and GitOps practices
  • Strong scripting and automation skills in Bash and/or Python
  • Solid knowledge of networking principles
  • A proactive mindset with the ability to work in a fast-paced environment

Nice to have

  • familiarity with incident management practices (on-call, runbooks, postmortem, disaster recovery)
  • Understand Zero Trust security models and security best practices in cloud environments
  • exposure to Service Mesh (Istio, Linkerd) and container networking
  • experience with cost optimisation and cloud spend monitoring
  • Knowledge of managing permission models on distributed systems