Staff Technical Program Manager, Site Reliability Engineering

MongoDB MongoDB · Enterprise · San Francisco, CA · PTO Site Reliability Engineering

This role is for a Staff Technical Program Manager (TPM) focused on Site Reliability Engineering (SRE) at MongoDB. The TPM will partner with SRE leaders and engineers to scale the platform underpinning MongoDB's cloud products. Key responsibilities include driving program planning and execution, strengthening production reliability practices, leading cross-functional coordination with teams like Security and Compliance, and building scalable systems and processes. The role requires strong knowledge of production change management, software development lifecycle, and reliability metrics (SLOs/SLIs), with the ability to interpret metrics and logs. While the company mentions AI and its database platform for the AI era, this specific role is focused on the underlying platform reliability and SRE practices, not direct AI/ML model development or research.

What you'd actually do

  1. Drive Program Planning & Execution – Define program scope, milestones, and success criteria with SRE engineers and leaders. Manage dependencies across platform teams, keep work clearly tracked in Jira, and deliver on time
  2. Strengthen Production Reliability – Lead change management and launch readiness programs. Partner with SREs and product teams to define and operationalize SLOs/SLIs, and use incident data, metrics, and capacity signals to drive prioritization and continuous improvement
  3. Lead Cross-Functional Coordination – Align SRE with Security, Compliance, Cloud platform, and other engineering teams. Coordinate cross-team incident response, ensure clear follow-through, and build trust as the go-to driver of complex, multi-team efforts
  4. Build Scalable Systems & Processes – Design lightweight frameworks and communication patterns that help SRE deliver reliably at scale. Work yourself out of the "hero" role by leaving teams better-equipped to execute independently

Skills

Required

  • 8+ years in technical program management, engineering management, or a comparable technical role partnering with software engineering teams
  • Proven track record leading large-scale, cross-team platform initiatives through ambiguity and change
  • Strong knowledge of production change management, software development lifecycle, and reliability metrics (SLOs, SLIs)
  • Skilled at shaping roadmaps and managing dependencies
  • Able to query and interpret metrics, logs, or other data sources to inform decisions and communicate risk
  • Excellent communicator—clear, concise, and calm—across engineers, cross-functional partners, and executives
  • Low-ego, highly collaborative, and motivated by ownership of hard problems end to end

Nice to have

  • Hands-on or close-partner experience with Kubernetes, cloud networking, or observability stacks (metrics, logs, tracing, alerting)
  • Prior experience working with or alongside SRE teams
  • Background in large-scale cloud infrastructure or platform engineering
  • Familiarity with MongoDB Atlas or other modern cloud database platforms

What the JD emphasized

  • scale the platform
  • production reliability
  • cross-functional efforts
  • SLOs/SLIs
  • incident data
  • capacity signals
  • continuous improvement
  • incident response
  • scalable systems
  • scalable processes
  • large-scale, cross-team platform initiatives through ambiguity and change
  • production change management
  • software development lifecycle
  • reliability metrics (SLOs, SLIs)
  • shaping roadmaps
  • managing dependencies
  • query and interpret metrics, logs, or other data sources
  • communicate risk
  • hard problems end to end