Product Manager, Public Sector Genai Test & Evaluation (t&e)

Scale AI Scale AI · Data AI · Washington, DC · Public Sector Product

Product Manager for GenAI Test & Evaluation (T&E) in the Public Sector team at Scale AI. This role focuses on defining the vision and roadmap for evaluation capabilities, owning the T&E tech stack to measure and improve agentic applications. Requires strong engineering depth, experience with evaluation systems, problem distillation, ambiguity management, cross-functional leadership, and operational execution. Experience with GenAI implementation, public sector work, and security clearance are preferred.

What you'd actually do

  1. Defining the vision and owning the roadmap for our evaluation capabilities.
  2. You will be the primary owner for the T&E tech stack—the robust infrastructure required to continuously measure, improve, and prove the superiority and sustained performance of our agentic applications.
  3. Traversing multiple engineering organizations across Scale, you will identify bottlenecks, distill technical friction into actionable plans, and drive execution.
  4. You will work across Scale’s commercial and public sector teams to define requirements, ensuring our evaluation services are robust enough for the most demanding government use cases.
  5. Key objectives include refining the tech stack that allows ML teams to hillclimb, and surfacing critical performance information to stakeholders.

Skills

Required

  • 3+ years of experience in software engineering, systems architecture, or highly technical program management.
  • Ability to read code, understand system architecture, and participate in technical design reviews.
  • Experience designing, owning the roadmap for, or operating evaluation infrastructure for AI applications.
  • Experience distilling vague problems into technical roadmaps and measurable success metrics.
  • Experience taking projects from undefined to shipped in high-pressure environments.
  • Experience leading cross-functional projects involving multiple engineering organizations.
  • Experience using technical project management frameworks for reporting delivery velocity and blockers.

Nice to have

  • Active Secret, Top Secret, or TS/SCI clearance.
  • Practical experience developing or evaluating features built specifically on LLMs, RAG, or autonomous agent workflows.
  • Advanced degree in Computer Science, Engineering, or a related field.
  • 2+ years of experience working with DoD, IC, or Civil agencies on mission-critical software deployments.

What the JD emphasized

  • Proven experience designing, owning the roadmap for, or operating the infrastructure required to continuously measure, improve, and show the performance of AI applications.
  • Demonstrated experience taking a vaguely defined problem (e.g., "our evaluation cycles are too slow") and delivering a technical roadmap, resource requirements, and measurable success metrics within a narrow time window.
  • Proven track record of taking a project from "stalled/undefined" to "shipped" in a high-pressure environment. You can point to at least two instances where you inherited a failing project and saw it through to production.
  • Led multiple projects that required direct alignment between at least three distinct engineering organizations (e.g., Infrastructure, ML Research, and Product).

Other signals

  • Defining the vision and owning the roadmap for our evaluation capabilities.
  • Primary owner for the T&E tech stack—the robust infrastructure required to continuously measure, improve, and prove the superiority and sustained performance of our agentic applications.
  • Refining the tech stack that allows ML teams to hillclimb, and surfacing critical performance information to stakeholders.