Sr. Engineering Manager, Platform

Vercel Vercel · Enterprise · United States · Remote · Engineering

Vercel is seeking a Sr. Engineering Manager for their Platform team to lead two teams focused on Reliability & Resilience and Developer Productivity. The role involves owning strategy, execution, and people leadership for core platform and operational domains including local developer environments, CI/CD, Kubernetes, API gateway, storage and caching, observability, secrets management, cost and capacity management, and SaaS vendor relationships. The ideal candidate will have experience managing engineers, technical depth in Kubernetes production operations and CI/CD, and a track record of owning key platform dependencies and reliability programs.

What you'd actually do

  1. Lead and grow a team of highly independent engineers across Reliability & Resilience and Developer Productivity teams; set org structure, hiring plan, and delivery goals.
  2. Own the platform roadmap and execution for improvements in development velocity, iteration speed, platform availability, and deployment safety.
  3. Build an industry-leading reliability practice: manage SLOs and error budgets, run incident response and postmortems, and prioritize resilience work across critical services.
  4. Operate and evolve core platform services including API gateway, storage and caching infrastructure, secrets management, and observability.
  5. Manage capacity and cost: forecasting, right-sizing, tuning, and spend governance tied to workload and growth plans.

Skills

Required

  • Kubernetes production operations
  • CI/CD systems
  • API gateways
  • caches
  • petabyte-scale KV stores and databases
  • SLOs
  • error budgets
  • incident response
  • postmortems
  • large-scale distributed systems

Nice to have

  • managing managers
  • Vercel platform

What the JD emphasized

  • 3+ years managing engineers
  • Hands-on technical depth in Kubernetes production operations, CI/CD systems.
  • Track record owning key platform dependencies such as API gateways, caches, petabyte-scale KV stores and databases.
  • Demonstrated ownership of reliability programs: SLOs, error budgets, incident response, postmortems, and measurable reductions in downtime.
  • 8+ years building and operating large-scale distributed systems