Sr. Production Engineer

Pinterest Pinterest · Consumer · Toronto, ON · Infrastructure and SRE

This role is for a Sr. Site Reliability Engineer on the Compute SRE team, responsible for ensuring compute workloads run smoothly on Pinterest's Kubernetes infrastructure. The role involves designing and building systems, platforms, and tools to ensure reliability at scale. A significant aspect of the role involves using AI tools for development, incident analysis, and operational improvements, as well as critically evaluating AI-assisted work.

What you'd actually do

  1. Tackle project challenges on EKS, such as implementing Karpenter. This work affects how every developer codes, tests, and improves their work
  2. Collaborate across various teams to drive projects forward using open-source tools
  3. Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  4. Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  5. Build meaningful, insightful and actionable SLIs
  6. Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
  7. Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
  8. Use AI for analysis of incidents, operational signals, and system behaviors to help identify patterns and generate plans and propose remediation approaches.
  9. Leverage AI to speed development of runbooks, automation workflows, reliability tooling by drafting, iterating, and refining approaches.

Skills

Required

  • Kubernetes (EKS)
  • Python or Golang
  • Project management
  • AI-assisted development tools (Cursor, GitHub Copilot, Claude)
  • Prompt engineering for LLMs
  • Evaluating AI-assisted work
  • Terraform
  • Buildkite
  • ArgoCD

Nice to have

  • EKS implementation (Karpenter)

What the JD emphasized

  • AI-assisted development tools
  • write effective prompts
  • use AI to improve speed and quality
  • critical evaluation and verification of AI-assisted work
  • avoid over-reliance on AI