Staff Technical Program Manager

Crusoe · Data AI · Tel Aviv, IL · Product and Design

Staff Technical Program Manager for Crusoe's Managed Inference platform, focusing on end-to-end program delivery, model onboarding, inference optimization, and production readiness for LLM workloads. The role requires deep familiarity with LLM serving, optimization, and evaluation in production, coordinating across model engineering, IaaS, and operations.

What you'd actually do

End-to-end program delivery: Own multi-quarter release planning, dependency governance, and executive communication across the Managed Inference platform.
Complex, high-risk program management: Drive model version rollouts, inference optimization campaigns, SLA readiness for new GPU hardware, and multi-tenant capacity planning from kickoff through delivery.
Cross-functional alignment: Coordinate across Model Engineering, IaaS, Cloud Foundations, Data Center Operations, and external model providers to keep programs on track and unblocked.
Proactive risk identification: Surface risks across model serving, reliability, capacity constraints, and vendor timelines before they become program-level problems.
Execution frameworks and dashboards: Build lightweight, scalable TPM frameworks suited to Crusoe's pace; maintain real-time execution dashboards and deliver crisp, data-driven executive updates.

Skills

Required

7+ years of experience as a Technical Program Manager in fast-paced technical environments, with a track record of owning complex programs end-to-end across engineering and product organizations.
LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
Multi-tenant systems experience: Familiarity with isolation, quota management, and SLA enforcement across concurrent workloads.
Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.
Executive communication: Exceptional written and verbal communication for delivering clear, data-driven, decision-oriented updates to executive stakeholders.
AI tool integration: Active, daily use of AI tools to improve program execution, risk detection, and communication -- not just personal productivity.
Cross-functional influence: Proven ability to drive alignment across engineering, product, and infrastructure leadership without direct authority, including with highly technical stakeholders.

Nice to have

Experience working with teams building platforms or services for AI inference and/or training.
Direct experience governing model onboarding programs across GPU generations, including firmware, driver, and stack validation.
Experience coaching or mentoring junior TPMs in a high-growth technical environment.
Exposure to multi-site or globally distributed engineering teams.
Background at a Series D to Series F company or a high-performing team within a hyperscaler focused on AI infrastructure.

What the JD emphasized

Deep familiarity with the model layer -- including how LLMs are served, optimized, and evaluated in production -- is essential to being effective in this role.
LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.

Other signals

Managed Inference platform
production LLM workloads
model onboarding
inference optimization
production readiness

Read full job description

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About This Role:

Crusoe is the world's first vertically integrated, sustainable AI cloud. We build and operate GPU infrastructure powered by clean energy, from data center design through IaaS products to managed inference at scale, enabling AI-native companies to run demanding workloads without compromising on sustainability or reliability. Crusoe Cloud is 1,400 people and growing, and the TPM frameworks are still being built -- which means there is a genuine opportunity to shape how the function operates rather than inherit how it already works.

The Managed Inference platform is where customers run production LLM workloads without managing low-level infrastructure, and it is one of Crusoe's fastest-growing product areas. The Staff TPM for Managed Intelligence connects model engineering, IaaS, product, and data center operations to deliver a reliable, scalable inference platform. You will own end-to-end program delivery across multi-quarter roadmaps, model onboarding, inference optimization, and production readiness for new model versions. Deep familiarity with the model layer -- including how LLMs are served, optimized, and evaluated in production -- is essential to being effective in this role.

What You'll Be Working On:

End-to-end program delivery: Own multi-quarter release planning, dependency governance, and executive communication across the Managed Inference platform.
Complex, high-risk program management: Drive model version rollouts, inference optimization campaigns, SLA readiness for new GPU hardware, and multi-tenant capacity planning from kickoff through delivery.
Cross-functional alignment: Coordinate across Model Engineering, IaaS, Cloud Foundations, Data Center Operations, and external model providers to keep programs on track and unblocked.
Proactive risk identification: Surface risks across model serving, reliability, capacity constraints, and vendor timelines before they become program-level problems.
Execution frameworks and dashboards: Build lightweight, scalable TPM frameworks suited to Crusoe's pace; maintain real-time execution dashboards and deliver crisp, data-driven executive updates.
Phase 0 planning for model onboarding: Own pre-launch planning for model onboarding on new GPU generations, including firmware and driver readiness, CUDA and ROCm stack validation, and commissioning criteria for inference workloads.
Stakeholder leadership: Drive alignment and push back effectively across engineering, product, and operations leadership -- including highly technical stakeholders who have not previously worked with a TPM.

What You'll Bring to the Team:

7+ years of experience as a Technical Program Manager in fast-paced technical environments, with a track record of owning complex programs end-to-end across engineering and product organizations.
LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
Multi-tenant systems experience: Familiarity with isolation, quota management, and SLA enforcement across concurrent workloads.
Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.
Executive communication: Exceptional written and verbal communication for delivering clear, data-driven, decision-oriented updates to executive stakeholders.
AI tool integration: Active, daily use of AI tools to improve program execution, risk detection, and communication -- not just personal productivity.
Cross-functional influence: Proven ability to drive alignment across engineering, product, and infrastructure leadership without direct authority, including with highly technical stakeholders.

Bonus Points:

Experience working with teams building platforms or services for AI inference and/or training.
Direct experience governing model onboarding programs across GPU generations, including firmware, driver, and stack validation.
Experience coaching or mentoring junior TPMs in a high-growth technical environment.
Exposure to multi-site or globally distributed engineering teams.
Background at a Series D to Series F company or a high-performing team within a hyperscaler focused on AI infrastructure.

Benefits:

Crusoe also offers a competitive benefits package designed to support financial security, health, and overall well-being. Our benefits are tailored to local market standards and include core offerings such as pension contributions and additional perks to support work-life balance.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

About This Role:

What You'll Be Working On:

End-to-end program delivery: Own multi-quarter release planning, dependency governance, and executive communication across the Managed Inference platform.
Complex, high-risk program management: Drive model version rollouts, inference optimization campaigns, SLA readiness for new GPU hardware, and multi-tenant capacity planning from kickoff through delivery.
Cross-functional alignment: Coordinate across Model Engineering, IaaS, Cloud Foundations, Data Center Operations, and external model providers to keep programs on track and unblocked.
Proactive risk identification: Surface risks across model serving, reliability, capacity constraints, and vendor timelines before they become program-level problems.
Execution frameworks and dashboards: Build lightweight, scalable TPM frameworks suited to Crusoe's pace; maintain real-time execution dashboards and deliver crisp, data-driven executive updates.
Phase 0 planning for model onboarding: Own pre-launch planning for model onboarding on new GPU generations, including firmware and driver readiness, CUDA and ROCm stack validation, and commissioning criteria for inference workloads.
Stakeholder leadership: Drive alignment and push back effectively across engineering, product, and operations leadership -- including highly technical stakeholders who have not previously worked with a TPM.

What You'll Bring to the Team:

7+ years of experience as a Technical Program Manager in fast-paced technical environments, with a track record of owning complex programs end-to-end across engineering and product organizations.
LLM inference and model serving knowledge: Working familiarity with batching strategies, quantization approaches, and the tradeoffs that govern latency, throughput, and cost at production scale.
Multi-tenant systems experience: Familiarity with isolation, quota management, and SLA enforcement across concurrent workloads.
Fine-tuning and alignment awareness: Sufficient familiarity with fine-tuning and alignment workflows to govern program timelines, identify technical risks, and coordinate across the teams that own them.
Low-structure execution: Proven ability to build execution models in environments where the process did not yet exist, and make them stick with teams that didn't ask for them.
Executive communication: Exceptional written and verbal communication for delivering clear, data-driven, decision-oriented updates to executive stakeholders.
AI tool integration: Active, daily use of AI tools to improve program execution, risk detection, and communication -- not just personal productivity.
Cross-functional influence: Proven ability to drive alignment across engineering, product, and infrastructure leadership without direct authority, including with highly technical stakeholders.

Bonus Points:

Experience working with teams building platforms or services for AI inference and/or training.
Direct experience governing model onboarding programs across GPU generations, including firmware, driver, and stack validation.
Experience coaching or mentoring junior TPMs in a high-growth technical environment.
Exposure to multi-site or globally distributed engineering teams.
Background at a Series D to Series F company or a high-performing team within a hyperscaler focused on AI infrastructure.

Benefits: