What you'd actually do

Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs

Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis

Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry

Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems

Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure

Skills

Required

5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
Scripting or programming experience with languages such as Go, Python, C, or C++
Strong communication skills and the ability to collaborate across engineering teams
Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments

Nice to have

Experience working with Kubernetes or container orchestration platforms at scale
Exposure to change management processes, operational readiness reviews, or structured root cause analysis
Experience designing self-healing systems, automated remediation, or event-driven operational tooling
Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
Passion for mentorship, learning, and developing deeper expertise in Production Engineering

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About This Role:

Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and Production Engineering sits at the heart of that mission. As a Production Engineer focused on Operational Excellence, you will help ensure the reliability, scalability, and performance of Crusoe’s GPU cloud that powers next-generation AI workloads.

This role is ideal for engineers who enjoy solving complex production problems, improving large-scale distributed systems, and building automation that keeps infrastructure running smoothly. You’ll play a key role in strengthening the operational foundation of Crusoe’s cloud while helping scale infrastructure that supports demanding AI and HPC workloads.

You’ll partner closely with Production Engineers, infrastructure teams, and platform engineers to improve system reliability, reduce operational toil, and drive continuous improvements across Crusoe’s rapidly growing GPU cloud.

What You’ll Be Working On:

Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure

What You’ll Bring to the Team:

5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
Scripting or programming experience with languages such as Go, Python, C, or C++
Strong communication skills and the ability to collaborate across engineering teams
Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
A growth mindset and strong interest in reliability engineering, automation, and operational excellence

Bonus Points:

Experience working with Kubernetes or container orchestration platforms at scale
Exposure to change management processes, operational readiness reviews, or structured root cause analysis
Experience designing self-healing systems, automated remediation, or event-driven operational tooling
Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
Passion for mentorship, learning, and developing deeper expertise in Production Engineering

Benefits:

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month

Compensation:

Compensation will be paid in the range of $172,000 – $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation will be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

About This Role:

What You’ll Be Working On:

Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure

What You’ll Bring to the Team:

5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
Scripting or programming experience with languages such as Go, Python, C, or C++
Strong communication skills and the ability to collaborate across engineering teams
Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
A growth mindset and strong interest in reliability engineering, automation, and operational excellence

Bonus Points:

Experience working with Kubernetes or container orchestration platforms at scale
Exposure to change management processes, operational readiness reviews, or structured root cause analysis
Experience designing self-healing systems, automated remediation, or event-driven operational tooling
Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
Passion for mentorship, learning, and developing deeper expertise in Production Engineering

Benefits:

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month

Senior Production Engineer, Operational Excellence

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About This Role:

What You’ll Be Working On:

What You’ll Bring to the Team:

Bonus Points:

Benefits:

Compensation:

About This Role:

What You’ll Be Working On:

What You’ll Bring to the Team:

Bonus Points:

Benefits:

Compensation: