What you'd actually do

Manage, coach, and grow a team of production engineers across shifts and time zones. Run structured 1:1s focused on career development, deliver candid performance feedback, and build a team culture grounded in ownership and continuous improvement.

Partner with engineering leadership and recruiting to grow the team — owning the full hiring lifecycle from interview design to offer. Build and continuously improve onboarding and training programs that ramp new engineers quickly and effectively.

Serve as an escalation point for high-severity incidents. Lead postmortems with a focus on systemic fixes, ensure action items are tracked and completed, and drive down MTTR over time.

Define, monitor, and report on SLIs, SLOs, and SLAs across Crusoe's production systems. Surface trends proactively and partner with engineering teams to address reliability gaps before they become customer issues.

Oversee the design and maintenance of alerting and observability systems across bare-metal and cloud infrastructure, ensuring the team has the signal it needs to detect and respond to issues fast.

Skills

Required

6+ years of experience managing 24/7 technical operations or SRE teams
Strong Linux and infrastructure fundamentals
hands-on experience with containerization, Kubernetes, and virtualization in production environments
Observability and monitoring expertise
Familiarity with messaging and workflow systems such as RabbitMQ, Kafka, NATS, or Temporal
Working proficiency in Golang or Python
Demonstrated people management skills
SLA/SLO ownership experience
A track record of influencing cross-functional strategy and driving alignment across engineering leadership on operational priorities

Nice to have

Experience with GPU infrastructure, HPC, or AI/ML cloud environments
Familiarity with infrastructure-as-code tooling such as Terraform or Ansible
Experience scaling an operations team and function

What the JD emphasized

accelerate the abundance of energy and intelligence

AI infrastructure

energy-first approach

scale of our ambition

path not fully paved

meaningful work of your career

advance their AI strategies

high-performing team

Production Engineering

GPU infrastructure

reliability and operational health

deep technical leadership

organizational impact

reliability strategy

building a high-performing team

complex systems reliable at scale

significant ownership

strategic impact

critical moment

Team Leadership & Development

Incident Management

Reliability & SLO Ownership

Monitoring & Alerting

Automation & Toil Reduction

Cross-Functional Partnership

Operational Cadence

6+ years of experience managing 24/7 technical operations or SRE teams

demonstrated success developing senior engineers

building organizational capability

improving operational outcomes at scale

Strong Linux and infrastructure fundamentals

hands-on experience with containerization, Kubernetes, and virtualization in production environments

Observability and monitoring expertise

Prometheus, VictoriaMetrics, and custom exporters

bare-metal endpoints

messaging and workflow systems

RabbitMQ, Kafka, NATS, or Temporal

distributed production environments

Golang or Python

review production code

technical design discussions

support your engineers' work

Demonstrated people management skills

structured performance management

individualized coaching

building or improving onboarding and training programs

SLA/SLO ownership experience

customer-facing environment

influencing cross-functional strategy

driving alignment across engineering leadership

operational priorities

GPU infrastructure

HPC

AI/ML cloud environments

infrastructure-as-code tooling

Terraform or Ansible

scaling an operations team and function

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About This Role:

Crusoe is building the cloud infrastructure that powers the next generation of AI, and we're looking for a Senior Engineering Manager, Production Engineering to lead the team that keeps it running. This is a senior people management role reporting to the Director of Production Engineering — sitting at the intersection of deep technical leadership and organizational impact, with direct ownership over the reliability and operational health of Crusoe's production GPU infrastructure. You'll lead and develop a 24/7 team responsible for incident response, monitoring and alerting, automation, and continuous system improvement across a fast-scaling, high-stakes environment, while also shaping the broader strategy, culture, and structure of the function.

The ideal candidate is a seasoned technical leader who has built, scaled, and managed on-call operations teams in complex environments — someone who brings both rigor and vision to SLOs and postmortems, takes coaching and performance management seriously, and can drive alignment across engineering leadership on reliability strategy. If you're energized by the challenge of building a high-performing team while keeping complex systems reliable at scale, this role offers significant ownership and strategic impact at a critical moment in Crusoe's growth.

What You'll Be Working On:

Team Leadership & Development: Manage, coach, and grow a team of production engineers across shifts and time zones. Run structured 1:1s focused on career development, deliver candid performance feedback, and build a team culture grounded in ownership and continuous improvement.
Hiring & Onboarding: Partner with engineering leadership and recruiting to grow the team — owning the full hiring lifecycle from interview design to offer. Build and continuously improve onboarding and training programs that ramp new engineers quickly and effectively.
Incident Management: Serve as an escalation point for high-severity incidents. Lead postmortems with a focus on systemic fixes, ensure action items are tracked and completed, and drive down MTTR over time.
Reliability & SLO Ownership: Define, monitor, and report on SLIs, SLOs, and SLAs across Crusoe's production systems. Surface trends proactively and partner with engineering teams to address reliability gaps before they become customer issues.
Monitoring & Alerting: Oversee the design and maintenance of alerting and observability systems across bare-metal and cloud infrastructure, ensuring the team has the signal it needs to detect and respond to issues fast.
Automation & Toil Reduction: Identify and prioritize opportunities to automate repetitive operational work, improving team efficiency and system resilience over time.
Cross-Functional Partnership: Collaborate with infrastructure, platform engineering, product, and customer success teams to align on technical escalations, customer impact, and engineering priorities.
Operational Cadence: Own the team's day-to-day operational rhythm — stand-ups, on-call rotations, incident reviews, and sprint planning — ensuring the team runs smoothly across time zones.

What You'll Bring to the Team:

6+ years of experience managing 24/7 technical operations or SRE teams in cloud or data center environments, including demonstrated success developing senior engineers, building organizational capability, and improving operational outcomes at scale.
Strong Linux and infrastructure fundamentals, including hands-on experience with containerization, Kubernetes, and virtualization in production environments.
Observability and monitoring expertise, including experience with Prometheus, VictoriaMetrics, and custom exporters — ideally against bare-metal endpoints.
Familiarity with messaging and workflow systems such as RabbitMQ, Kafka, NATS, or Temporal, and an understanding of how they function in distributed production environments.
Working proficiency in Golang or Python — enough to review production code, contribute meaningfully to technical design discussions, and support your engineers' work.
Demonstrated people management skills, including experience with structured performance management, individualized coaching, and building or improving onboarding and training programs.
SLA/SLO ownership experience — you've set them, measured them, reported on them, and held teams accountable to them in a customer-facing environment.
A track record of influencing cross-functional strategy and driving alignment across engineering leadership on operational priorities.

Bonus Points:

Experience with GPU infrastructure, HPC, or AI/ML cloud environments.
Familiarity with infrastructure-as-code tooling such as Terraform or Ansible.
Experience scaling an operations team and function through a period of rapid headcount or infrastructure growth.
Background in data center operations, including familiarity with physical infrastructure, hardware lifecycle, and network fundamentals.

Benefits:

Crusoe also offers a competitive benefits package designed to support financial security, health, and overall well-being, including pension contributions, private health and dental insurance, income protection, life assurance and more.

Compensation:

Compensation will be paid as salary or hourly. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.