What you'd actually do

Design and build AI agents that augment production reliability work - Develop agents that assist engineers with service health analysis, reliability recommendations, migration playbook generation, and risk identification, enabling faster decision-making while keeping humans in the loop for critical judgment calls

Drive large-scale infrastructure modernization with AI-accelerated execution - Lead Kubernetes adoption and platform transitions using AI to generate automation, accelerate delivery, and create patterns that enable self-service adoption for standard use cases while tackling novel architecture challenges

Transform consulting patterns into scalable platforms - Execute scoped reliability engagements with engineering teams, then encode successful approaches into AI-assisted tools, automation, and self-serve documentation that enable teams to handle similar problems independently while escalating complex challenges to experts

Build the knowledge infrastructure that powers Pinterest's operational agent ecosystem - Create migration playbooks, operational runbooks, incident patterns, and best practices that democratize reliability expertise and raise the baseline capabilities of all Pinterest engineers

Develop software solutions to enable reliability and operability of large-scale distributed systems - Build a deep understanding of how Pinterest's systems behave, scale, interact and fail, and use that insight to identify risks and opportunities for remediation through automation

Skills

Required

Python or Go
Linux/Unix internals
open source infrastructure (MySQL, Kafka, Envoy, Hadoop, etc.)
Infrastructure as Code (Terraform, Puppet, Chef, Ansible, Docker, Kubernetes)
cloud infrastructure (AWS, GCP, or Azure)
distributed, service-oriented architecture

Nice to have

Experience developing AI agents for infrastructure automation, operational decision-making, or reliability workflows
AI/ML infrastructure experience (LLM-based systems, model serving, agentic workflows)
Technical consulting or embedded SRE experience with cross-functional engineering teams

About Pinterest:

Millions of people around the world come to our platform to find creative ideas, dream about new possibilities and plan for memories that will last a lifetime. At Pinterest, we’re on a mission to bring everyone the inspiration to create a life they love, and that starts with the people behind the product.

Discover a career where you ignite innovation for millions, transform passion into growth opportunities, celebrate each other’s unique experiences and embrace the flexibility to do your best work. Creating a career you love? It’s Possible.

At Pinterest, AI isn't just a feature, it's a powerful partner that augments our creativity and amplifies our impact, and we’re looking for candidates who are excited to be a part of that. To get a complete picture of your experience and abilities, we’ll explore your foundational skills and how you collaborate with AI.

Through our interview process, what matters most is that you can always explain your approach, showing us not just what you know, but how you think. You can read more about our AI interview philosophy and how we use AI in our recruiting process here.

The Production Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams' capability to design, build and operate robust systems at scale. Pinterest's applications and infrastructure handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale.

As a Senior Production Engineer on Solutions Engineering, you will design and build AI agents, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems serving hundreds of millions of monthly active users, handling hundreds of thousands of requests per second, and managing tens of petabytes of data. You'll lead infrastructure modernization initiatives, build intelligent automation that eliminates operational toil and amplifies engineering productivity, and transform successful consulting patterns into reusable platforms that democratize reliability expertise across Pinterest's 2500+ engineers.

What you’ll do

Design and build AI agents that augment production reliability work - Develop agents that assist engineers with service health analysis, reliability recommendations, migration playbook generation, and risk identification, enabling faster decision-making while keeping humans in the loop for critical judgment calls
Drive large-scale infrastructure modernization with AI-accelerated execution - Lead Kubernetes adoption and platform transitions using AI to generate automation, accelerate delivery, and create patterns that enable self-service adoption for standard use cases while tackling novel architecture challenges
Transform consulting patterns into scalable platforms - Execute scoped reliability engagements with engineering teams, then encode successful approaches into AI-assisted tools, automation, and self-serve documentation that enable teams to handle similar problems independently while escalating complex challenges to experts
Build the knowledge infrastructure that powers Pinterest's operational agent ecosystem - Create migration playbooks, operational runbooks, incident patterns, and best practices that democratize reliability expertise and raise the baseline capabilities of all Pinterest engineers
Develop software solutions to enable reliability and operability of large-scale distributed systems - Build a deep understanding of how Pinterest's systems behave, scale, interact and fail, and use that insight to identify risks and opportunities for remediation through automation
Build tools and automation to eliminate toil and reduce operational overhead - Create frameworks, processes and best practices that encode reliability expertise into software, making operational excellence accessible to all engineers while freeing experts to tackle harder problems
Build meaningful, insightful and actionable SLIs - Develop service level indicators that provide clear signals of system health and enable data-driven reliability decisions across Pinterest Engineering
Automate critical portions of Pinterest's engineering processes - Build automation that minimizes risk and maximizes the speed of innovation, enabling safe, rapid deployment and operational changes at scale
Manage capacity and performance to help scale our infrastructure - Partner with teams to plan and optimize capacity across public and private clouds around the world, ensuring efficient resource utilization as Pinterest grows

What we’re looking for:

5+ years of industry experience building and operating large-scale, high-performance distributed systems
Bachelor's degree in Computer Science or related field, or equivalent experience
Strong programming skills in Python or Go - ability to build production-grade platforms, agents, and automation
Deep knowledge of Linux/Unix internals and experience with open source infrastructure (MySQL, Kafka, Envoy, Hadoop, etc.)
Infrastructure as Code experience (Terraform, Puppet, Chef, Ansible, Docker, Kubernetes)
Experience deploying web applications to cloud infrastructure (AWS, GCP, or Azure) and working with distributed, service-oriented architecture
Preferred:
- Experience developing AI agents for infrastructure automation, operational decision-making, or reliability workflows
- AI/ML infrastructure experience (LLM-based systems, model serving, agentic workflows)
- Technical consulting or embedded SRE experience with cross-functional engineering teams

**In-Office Requirement Statement: **

We let the type of work you do guide the collaboration style. That means we're not always working in an office, but we continue to gather for key moments of collaboration and connection.
This role will need to be in the office for in-person collaboration 1-2 times every 6 months and therefore can be situated anywhere in the country.

Relocation Statement:

This position is not eligible for relocation assistance. Visit ourPinFlex page to learn more about our working model.

#LI-REMOTE

#LI-JT1

At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.

Information regarding the culture at Pinterest and benefits available for this position can be found here.

US based applicants only

$139,764—$287,749 USD

Our Commitment to Inclusion:

Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, national origin, religion or religious creed, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, age, marital status, status as a protected veteran, physical or mental disability, medical condition, genetic information or characteristics (or those of a family member) or any other consideration made unlawful by applicable federal, state or local laws. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require a medical or religious accommodation during the job application process, please complete this form for support.

By submitting this application, I certify that all information submitted in my application and throughout the hiring process is true, accurate, and complete to the best of my knowledge. I understand that any false statement, omission, or misrepresentation may disqualify me from employment consideration or result in termination if discovered after hire.