Staff Technical Program Manager

Crusoe Crusoe · Data AI · Tel Aviv, IL · Product and Design

Staff Technical Program Manager for Crusoe's Managed Inference platform, focusing on end-to-end program delivery, model onboarding, inference optimization, and production readiness for LLM workloads. The role requires deep familiarity with LLM serving, optimization, and evaluation in production, coordinating across model engineering, IaaS, and operations.

What you'd actually do

  1. End-to-end program delivery: Own multi-quarter release planning, dependency governance, and executive communication across the Managed Inference platform.
  2. Complex, high-risk program management: Drive model version rollouts, inference optimization campaigns, SLA readiness for new GPU hardware, and multi-tenant capacity planning from kickoff through delivery.
  3. Cross-functional alignment: Coordinate across Model Engineering, IaaS, Cloud Foundations, Data Center Operations, and external model providers to keep programs on track and unblocked.
  4. Proactive risk identification: Surface risks across model serving, reliability, capacity constraints, and vendor timelines before they become program-level problems.
  5. Execution frameworks and dashboards: Build lightweight, scalable TPM frameworks suited to Crusoe's pace; maintain real-time execution dashboards and deliver crisp, data-driven executive updates.

Skills

Required

  • 7+ years of experience as a Technical Program Manager in fast-paced technical environments, with a track record of owning complex programs end-to-end across engineering and product organizations.
  • LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
  • Multi-tenant systems experience: Familiarity with isolation, quota management, and SLA enforcement across concurrent workloads.
  • Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
  • Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.
  • Executive communication: Exceptional written and verbal communication for delivering clear, data-driven, decision-oriented updates to executive stakeholders.
  • AI tool integration: Active, daily use of AI tools to improve program execution, risk detection, and communication -- not just personal productivity.
  • Cross-functional influence: Proven ability to drive alignment across engineering, product, and infrastructure leadership without direct authority, including with highly technical stakeholders.

Nice to have

  • Experience working with teams building platforms or services for AI inference and/or training.
  • Direct experience governing model onboarding programs across GPU generations, including firmware, driver, and stack validation.
  • Experience coaching or mentoring junior TPMs in a high-growth technical environment.
  • Exposure to multi-site or globally distributed engineering teams.
  • Background at a Series D to Series F company or a high-performing team within a hyperscaler focused on AI infrastructure.

What the JD emphasized

  • Deep familiarity with the model layer -- including how LLMs are served, optimized, and evaluated in production -- is essential to being effective in this role.
  • LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
  • Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
  • Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.

Other signals

  • Managed Inference platform
  • production LLM workloads
  • model onboarding
  • inference optimization
  • production readiness