What you'd actually do

Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform

Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them

Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt

Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments.

Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company

Skills

Required

15+ years of experience in infrastructure, networking, or production engineering
Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling
Hands-on data center experience
Ability to write code
Excellent analytical and problem-solving skills
Strong incident command

Nice to have

Deep networking expertise: BGP, OSPF, ECMP, load balancing, and low-latency network design in production
Experience with HPC infrastructure: GPU cluster operations, job schedulers (Slurm, Kubernetes), high-bandwidth interconnects (InfiniBand, RoCE)
Prior principal or staff IC role where you influenced org-level technical strategy
Exposure to sustainability-focused or energy-constrained compute environments

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About This Role:

Crusoe is building the AI factory which is a vertically integrated company spanning power generation, purpose-built data centers, and the cloud platform that frontier AI runs on. We are looking for a Principal Engineer on our Production Engineering team. Someone who will own the reliability, scalability, and operational excellence of the cloud infrastructure that sits on top of it all: compute, storage, networking, and the platform and tooling that ties it together. The systems you'll be responsible for are the reason that compute translates into usable cloud, and at the growth rate Crusoe is operating, the scope of this role expands with every quarter. This is a high-ownership, high-autonomy position where you will set technical direction, drive observability and reliability standards across the organization, and be the kind of engineer that makes the people around them meaningfully better. The problems are novel, the scale is real, and the impact is immediate.

What You'll Be Working On:

Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments.
Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company
Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard

What You'll Bring to the Team:

15+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)
Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up
Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level
The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use
Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs.
Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems

Bonus Points:

Deep networking expertise: BGP, OSPF, ECMP, load balancing, and low-latency network design in production — you can debug a routing issue and design a fabric, sometimes in the same incident
Experience with HPC infrastructure: GPU cluster operations, job schedulers (Slurm, Kubernetes), high-bandwidth interconnects (InfiniBand, RoCE)
Prior principal or staff IC role where you influenced org-level technical strategy, not just project-level execution
Exposure to sustainability-focused or energy-constrained compute environments

Benefits:

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month

Compensation:

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

About This Role:

What You'll Be Working On:

Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments.

Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard

What You'll Bring to the Team:

15+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)

Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up

Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level

The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use

Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs.

Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems

Bonus Points:

Deep networking expertise: BGP, OSPF, ECMP, load balancing, and low-latency network design in production — you can debug a routing issue and design a fabric, sometimes in the same incident

Experience with HPC infrastructure: GPU cluster operations, job schedulers (Slurm, Kubernetes), high-bandwidth interconnects (InfiniBand, RoCE)

Prior principal or staff IC role where you influenced org-level technical strategy, not just project-level execution

Exposure to sustainability-focused or energy-constrained compute environments

Benefits:

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 per month

Compensation:

Principal Production Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About This Role:

About This Role: