Technical Program Manager, Quality and Reliability

Harvey Harvey · AI Frontier · San Francisco, CA · Engineering

This role is for a Technical Program Manager focused on Quality and Reliability for an enterprise AI product. The TPM will own release management, introduce change safety standards, lead reliability initiatives, define and report on reliability metrics, identify systemic gaps in processes, drive issue resolution, and manage the incident management lifecycle. They will also oversee vendor reliability and SLA compliance. The role requires experience in technical program or release management, understanding of engineering workflows, and experience partnering with engineering and product leadership. Bonus points for familiarity with incident management and monitoring tools.

What you'd actually do

  1. Own release management end-to-end, ensuring on-time, high-quality product releases through coordination across all teams in a fast-paced environment.
  2. Introduce and enforce change safety standards, such as risk assessments, rollback procedures, feature flag, and bug bashes to reduce regressions and customer impact.
  3. Lead horizontal reliability initiatives focused on improving test coverage, observability, and incident response readiness.
  4. Define, measure, and report on reliability metrics (e.g. change failure rate, MTTR, SLI), and drive accountability for sustained improvement.
  5. Identify systemic gaps in release processes, testing, monitoring, and incident response; convert findings into structured improvement plans with clear owners and timelines.

Skills

Required

  • 5+ years of experience in technical program management or release management
  • Prior Experience working as Software QA or Test Engineer
  • Strong understanding of engineering workflows, including CI/CD, release cycles, and infrastructure planning
  • Experience partnering with engineering and product leadership to achieve cross-team quality and reliability objectives
  • Excellent communication skills
  • A track record of building systems and processes that scale with growth
  • Comfort in ambiguity and eagerness to build structure where there is none

Nice to have

  • Familiarity with incident management tooling (PagerDuty, Incident.io)
  • Familiarity with monitoring stacks (Datadog, Prometheus, Grafana)
  • Familiarity with test automation frameworks (Playwright, Cypress, Selenium)

What the JD emphasized

  • ultimate owner of Harvey’s product quality
  • building repeatable, scalable reliability guardrails