Junior Technical Program Manager — Infrastructure Operations

Together AI Together AI · Data AI · San Francisco, CA · Product

This role focuses on the operational management of a large GPU fleet, ensuring nodes are online, GPUs are performing, and datacenter transitions are smooth. It involves owning the end-to-end node lifecycle, driving remediation, managing project timelines for new datacenter bring-ups, diagnosing utilization loss, and building dashboards for visibility and accountability. The environment is fast-paced and requires figuring things out alongside engineers building at the frontier.

What you'd actually do

  1. Own the end-to-end node lifecycle - from failure through repair, return, and re-integration — across provider ticketing, internal tooling, and the state machine that governs each stage
  2. Drive node remediation to resolution with urgency, eliminating gaps in ownership at every handoff
  3. Manage project timelines for new datacenter bring-ups, coordinating across internal teams and external providers to keep milestones on track
  4. Identify and diagnose GPU utilization loss across the fleet, working with engineering leads to drive resolution
  5. Build dashboards and tracking processes that make efficiency gaps visible and ensure they get closed

Skills

Required

  • TPM role experience
  • owning programs end-to-end
  • driving cross-functional resolution
  • managing external dependencies
  • technical background or demonstrated experience in a highly technical environment
  • bias toward action
  • resilience in a fast-paced, sometimes chaotic environment
  • strong organizational instincts
  • ability to zoom out

Nice to have

  • GPU knowledge

What the JD emphasized

  • genuinely high-stakes
  • genuinely novel
  • figuring things out alongside engineers who are building at the frontier
  • technical background or demonstrated experience in a highly technical environment
  • bias toward action
  • Resilience in a fast-paced, sometimes chaotic environment