Principal Production Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Crusoe is seeking a Principal Production Engineer to own the reliability, scalability, and operational excellence of their cloud infrastructure, which powers AI workloads. This role involves defining SLOs, leading incident response, building observability tools, driving platform reliability improvements, and setting technical standards for the production engineering organization. The ideal candidate has extensive experience in infrastructure and data center operations, with a strong understanding of distributed systems and the ability to write code for automation and tooling.

What you'd actually do

  1. Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
  2. Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
  3. Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
  4. Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments.
  5. Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company

Skills

Required

  • 15+ years of experience in infrastructure, networking, or production engineering
  • Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling
  • Hands-on data center experience
  • Ability to write code
  • Excellent analytical and problem-solving skills
  • Strong incident command

Nice to have

  • Deep networking expertise: BGP, OSPF, ECMP, load balancing, and low-latency network design in production
  • Experience with HPC infrastructure: GPU cluster operations, job schedulers (Slurm, Kubernetes), high-bandwidth interconnects (InfiniBand, RoCE)
  • Prior principal or staff IC role where you influenced org-level technical strategy
  • Exposure to sustainability-focused or energy-constrained compute environments

What the JD emphasized

  • 15+ years of experience in infrastructure, networking, or production engineering
  • meaningful time at companies operating at internet scale
  • Hands-on data center experience
  • The ability to write code
  • Deep networking expertise
  • Experience with HPC infrastructure