Site Reliability Engineer

Peloton Peloton · Consumer · Headquarters, NY · Software

Site Reliability Engineer with an operations focus to build and maintain a monitorable, performant, reliable, and highly-scalable deployment platform. The role will host critical infrastructure for tens of thousands of pods across multiple clusters, provide a platform for machine learning workloads, and promote best practices for building and operating highly reliable systems. Experience with Kubernetes, observability, monitoring, security, CI/CD, Infrastructure as Code, and programming languages like Python or Golang is required.

What you'd actually do

  1. Automatic, fast auto scaling for live rides and special large events
  2. Host a critical infrastructure that ensures that our members have the best experience possible on tens of thousands of pods across multiple clusters
  3. Provide a platform for machine learning (and other awesome workloads)
  4. Promote best practices for building and operating highly reliable systems
  5. Serve as domain expert in observability and monitoring

Skills

Required

  • Experience maintaining scalable and stable Kubernetes clusters
  • Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale
  • Knowledge of best practices in regards to securing a Kubernetes cluster and its deployments at scale
  • Experience with CI/CD Systems such as for example: Jenkins, ArgoCD, Harness, Tekton, etc.
  • Experience deployment infrastructure using Infrastructure as Code utilities such as Terraform or Pulumi
  • Know when to triage and when to dive down into a root-cause analysis
  • Experience with a programming language like Python, Golang, Java, C

Nice to have

  • A passion for helping development teams make the transition to a container-native world
  • Passion for reliable, scalable, observable software with a strong sense of ownership

What the JD emphasized

  • Experience maintaining scalable and stable Kubernetes clusters
  • Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale
  • Knowledge of best practices in regards to securing a Kubernetes cluster and its deployments at scale