Site Reliability Engineer - Multicloud Platform

Workday Workday · Enterprise · Dublin, Ireland

Site Reliability Engineer focused on ensuring the reliability, availability, and scalability of a cloud-native platform using Kubernetes, Istio, and other cloud technologies. Responsibilities include reducing operational toil through automation, implementing observability, and managing multi-stage deployment automation with SLO gating.

What you'd actually do

  1. Be a key member of team of versatile SREs responsible for software engineering and operations, with an emphasis on reducing operational toil.
  2. Develop and launch effective SLIs to ensure that SLOs are achieved through building an extendable Observability architecture, runbook automation, and establishing new processes.
  3. Partner with platform service teams to craft and implement a range of SRE standards for their respective services to meet.
  4. Define benchmarks and automation to qualify services to move to production environments.
  5. Are you a Site Reliability Engineer with who loves the challenge of automating, operating and improving innovative cloud native service platforms?

Skills

Required

  • 3 years in handling and solving distributed systems in a public cloud
  • 3+ years of SRE experience in a distributed systems environment.
  • Experience with AWS, GCP, or Azure
  • Strong experience with Kubernetes
  • Experience with Linux
  • Proficiency with a programming language such as GoLang, Python, or Ruby (preferably GoLang (Go))
  • Experienced with software development standard methodologies such as code management, CI/CD, testing

Nice to have

  • Kubernetes experience is a big plus
  • Passion for automation, with a track record of referenceable examples.
  • Can work independently and with the demeanor that everything can be automated.
  • Skills to operate, maintain, support and sustain the platform.
  • Energised by working in a fast-paced environment.
  • Experience collaborating with multi-functional global and remote teams with a diverse set of backgrounds.
  • Excellent documentation skills, experience with developing detailed runbooks, processes

What the JD emphasized

  • reduce operational toil
  • SLO gated multi-stage deployment automation
  • improve platform reliability, observability
  • building an extendable Observability architecture
  • runbook automation
  • qualify services to move to production environments
  • Kubernetes experience is a big plus
  • automation is the key to operating large-scale systems