Senior Production Engineer, Operational Excellence

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role focuses on ensuring the reliability, scalability, and performance of Crusoe's GPU cloud platform that powers AI and HPC workloads. The Production Engineer will be responsible for defining and evolving availability metrics, participating in incident response, building and improving observability tools, identifying reliability risks, and developing automation to reduce operational toil and improve recovery times. The role involves partnering with various infrastructure teams to strengthen service resilience and disaster recovery capabilities.

What you'd actually do

  1. Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
  2. Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
  3. Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
  4. Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  5. Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure

Skills

Required

  • 5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
  • Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
  • Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
  • Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
  • Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
  • Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
  • Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
  • Scripting or programming experience with languages such as Go, Python, C, or C++
  • Strong communication skills and the ability to collaborate across engineering teams
  • Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments

Nice to have

  • Experience working with Kubernetes or container orchestration platforms at scale
  • Exposure to change management processes, operational readiness reviews, or structured root cause analysis
  • Experience designing self-healing systems, automated remediation, or event-driven operational tooling
  • Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
  • Passion for mentorship, learning, and developing deeper expertise in Production Engineering

What the JD emphasized

  • GPU workloads
  • latency/throughput-sensitive distributed systems
  • complex production problems
  • large-scale distributed systems
  • AI infrastructure