Sr. Production Engineer, Solutions Engineering

Pinterest Pinterest · Consumer · Chicago, IL · Infrastructure and SRE

Senior Production Engineer role focused on designing and building AI agents, platforms, and tools to enhance the reliability and operability of large-scale distributed systems at Pinterest. The role involves creating AI agents for service health analysis, infrastructure modernization using AI, and transforming consulting patterns into scalable, AI-assisted platforms. It emphasizes building knowledge infrastructure for operational agents and developing software solutions for system reliability.

What you'd actually do

  1. Design and build AI agents that augment production reliability work - Develop agents that assist engineers with service health analysis, reliability recommendations, migration playbook generation, and risk identification, enabling faster decision-making while keeping humans in the loop for critical judgment calls
  2. Drive large-scale infrastructure modernization with AI-accelerated execution - Lead Kubernetes adoption and platform transitions using AI to generate automation, accelerate delivery, and create patterns that enable self-service adoption for standard use cases while tackling novel architecture challenges
  3. Transform consulting patterns into scalable platforms - Execute scoped reliability engagements with engineering teams, then encode successful approaches into AI-assisted tools, automation, and self-serve documentation that enable teams to handle similar problems independently while escalating complex challenges to experts
  4. Build the knowledge infrastructure that powers Pinterest's operational agent ecosystem - Create migration playbooks, operational runbooks, incident patterns, and best practices that democratize reliability expertise and raise the baseline capabilities of all Pinterest engineers
  5. Develop software solutions to enable reliability and operability of large-scale distributed systems - Build a deep understanding of how Pinterest's systems behave, scale, interact and fail, and use that insight to identify risks and opportunities for remediation through automation

Skills

Required

  • Python or Go
  • Linux/Unix internals
  • open source infrastructure (MySQL, Kafka, Envoy, Hadoop, etc.)
  • Infrastructure as Code (Terraform, Puppet, Chef, Ansible, Docker, Kubernetes)
  • cloud infrastructure (AWS, GCP, or Azure)
  • distributed, service-oriented architecture

Nice to have

  • Experience developing AI agents for infrastructure automation, operational decision-making, or reliability workflows
  • AI/ML infrastructure experience (LLM-based systems, model serving, agentic workflows)
  • Technical consulting or embedded SRE experience with cross-functional engineering teams

What the JD emphasized

  • AI agents
  • production reliability
  • large-scale distributed systems
  • AI-accelerated execution
  • AI-assisted tools
  • operational agent ecosystem
  • reliability expertise

Other signals

  • AI agents for production reliability
  • AI-accelerated infrastructure modernization
  • AI-assisted tools and automation