Technical Program Manager, AI Infrastructure

Character AI Character AI · AI Frontier · Redwood City, CA · Safety

Technical Program Manager for AI Infrastructure at Character.AI, focusing on leading programs for model development and serving systems at scale. This role involves shaping infrastructure strategy, aligning roadmaps, and driving execution across training, evaluation, and inference, ensuring systems are reliable, efficient, and scalable for millions of users. The TPM will partner with engineering, research, and product teams, manage complex initiatives, track key metrics, and improve developer velocity.

What you'd actually do

  1. Lead planning and execution of major AI infrastructure initiatives spanning training pipelines, data systems, model evaluation, and inference/serving
  2. Build structures that keep teams aligned: scopes, goals, requirements, timelines, risks, and success metrics
  3. Partner with engineering, research, and product to translate model and product needs into infrastructure roadmaps and priorities
  4. Drive cross-functional accountability and communication across teams working on tightly coupled systems
  5. Track key infrastructure metrics (e.g., reliability, latency, throughput, cost efficiency) and define reporting that surfaces progress and risk

Skills

Required

  • 5–8 years of experience in program management, technical operations, or product execution in a fast-moving, technical environment
  • Proven ability to lead complex, multi-team initiatives spanning engineering and infrastructure systems
  • Strong communication skills with the ability to bridge technical and non-technical stakeholders
  • Experience working closely with engineering teams on distributed systems, cloud infrastructure, or platform development
  • Ability to break down ambiguous problem spaces into clear plans and drive alignment across teams
  • Analytical mindset with experience using data to inform decisions and measure impact
  • Strong ownership and execution mindset, with the ability to manage multiple priorities simultaneously
  • Technical understanding of GPU clusters, AI accelerator chips, and cloud providers

Nice to have

  • Experience with AI cloud providers

What the JD emphasized

  • AI infrastructure
  • model development
  • serving systems
  • training pipelines
  • model evaluation
  • inference
  • scale
  • reliability
  • efficiency
  • rapid iteration
  • technically complex environments
  • ambiguous problem spaces
  • cross-functional teams
  • intersection of research and production
  • tradeoffs between speed, quality, and reliability
  • influencing without authority
  • anticipating risks early
  • large, interdependent efforts
  • distributed systems
  • cloud infrastructure
  • platform development
  • data to inform decisions
  • measure impact
  • ownership and execution mindset
  • multiple priorities simultaneously

Other signals

  • AI infrastructure
  • model development
  • serving systems
  • training pipelines
  • model evaluation
  • inference