Principal Engineer, Compute Platform

Pinterest Pinterest · Consumer · San Francisco, CA · Infrastructure and SRE

Principal Engineer to lead and scale the consolidation and modernization of Pinterest's compute platform (PinCompute), focusing on large-scale stateful and GPU-heavy AI workloads. The role involves designing and building around Kubernetes, handling stateful systems, optimizing workload utilization, and evolving the platform towards multi-cloud capabilities. Emphasis on delivering GPU resources for AI workloads and leveraging AI tools for migration, operations, and development velocity.

What you'd actually do

  1. Solving the challenges of replacing isolated pools of dedicated compute resources with a very large scale shared compute platform, shifting from machine-based designs to container-based designs.
  2. Working with leads across various platforms, especially stateful and data platforms, to build the right features and migration paths that work for them.
  3. Owning and driving up utilization on the shared compute platform by designing and implementing workload stacking, optimizing and bin packing, safe oversubscription, etc.
  4. Work with multiple customers with unique requirements to make sure the platform will address their needs and is not only a viable but a desirable solution for running their workloads.
  5. Leading a group of engineers around design topics, execution, trade offs, migration paths, observability, performance, and operability for the platform.

Skills

Required

  • 12+ years of relevant industry experience with large scale, production distributed systems
  • 5+ years of experience with Kubernetes in production
  • Experience working across SWE and SRE or Production Engineering teams to deliver robust production systems
  • Ability to work with cross-functional partners across multiple organizations

Nice to have

  • Experience with running distributed data systems and migrating them to Kubernetes is highly preferred
  • Passion for automation, reducing toil, and building proper tooling for getting the job done

What the JD emphasized

  • GPU-heavy AI workloads
  • large scale shared compute platform
  • Kubernetes in production
  • distributed systems
  • stateful systems
  • data-intensive workloads
  • GPU resources