Staff Production Engineer (operational Excellence)

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Crusoe is an AI infrastructure company building a GPU cloud for AI workloads. This role focuses on Production Engineering and Operational Excellence to ensure the reliability, scalability, and performance of their GPU cloud platform. The engineer will lead efforts in defining availability metrics, driving incident response, architecting observability, identifying reliability risks, and developing automation for large-scale distributed systems supporting AI and HPC workloads.

What you'd actually do

  1. Lead cross-functional efforts to define and evolve availability metrics for Crusoe's cloud platform, including establishing, measuring, and improving SLIs and SLOs
  2. Drive production incident response, diagnosing and resolving service disruptions while leading post-incident reviews and root cause analysis
  3. Architect, operate, and improve observability across Crusoe's infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
  4. Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  5. Design and develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure

Skills

Required

  • Production Engineering
  • SRE
  • large-scale infrastructure operations
  • supporting GPU workloads
  • HPC environments
  • latency/throughput-sensitive distributed systems
  • Infrastructure roles building or managing compute, storage or networking platforms
  • Linux/Unix systems
  • Kubernetes
  • distributed systems
  • virtualization
  • cloud platforms (AWS/GCP)
  • incident management practices
  • reliability frameworks (SRE, ITIL, or similar)
  • monitoring and observability tools such as Prometheus and Grafana
  • infrastructure-as-code and configuration management tools such as Terraform or Ansible
  • scripting or programming with languages such as Go, Python, C, or C++
  • communication skills
  • troubleshooting complex issues in high-impact production environments

Nice to have

  • leading Kubernetes or container orchestration platforms at scale
  • change management processes
  • operational readiness reviews
  • structured root cause analysis
  • designing self-healing systems
  • automated remediation
  • event-driven operational tooling
  • scaling AI or HPC infrastructure
  • solving reliability challenges in GPU-heavy environments
  • mentorship
  • growing teams
  • developing deep expertise in Production Engineering

What the JD emphasized

  • reliability
  • scalability
  • performance
  • GPU workloads
  • HPC environments
  • latency/throughput-sensitive distributed systems
  • large-scale distributed systems
  • operational excellence
  • reliability engineering
  • automation