Engineering Manager, Production Engineering

Crusoe Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Engineering Manager for Production Engineering team at Crusoe, an AI infrastructure company. The role focuses on leading SREs to ensure the reliability and scalability of AI infrastructure offerings like Kubernetes, Inference, and AutoClusters for enterprise customers. Responsibilities include driving reliability improvements, managing incident response, and contributing as an individual contributor to tooling and automation.

What you'd actually do

  1. Leading and growing a team of SREs embedded within Crusoe's AI product areas, setting technical direction and fostering a culture of ownership and continuous improvement
  2. Contributing as an IC — reviewing code, building tooling, and driving automation to reduce toil and improve the reliability and scalability of production services
  3. Owning SLA/SLO performance, incident response, and on-call health for service offerings; leading blameless post-mortems and driving systemic remediation
  4. Partnering with embedded product and platform engineering teams to influence infrastructure design, observability strategy, and operational readiness for new and existing services
  5. Defining and tracking reliability, performance, and operational maturity metrics across the team; translating data into prioritized roadmap investments

Skills

Required

  • 5+ years of software or infrastructure engineering experience
  • 1–2 years in an engineering management or tech lead role
  • SRE or production engineering background
  • incident management
  • SLO frameworks
  • runbooks
  • on-call operations
  • coding ability in Go, Python, or similar languages
  • building tooling and automation
  • working with or embedding into cross-functional product teams
  • influencing engineering decisions
  • container orchestration
  • cloud-native infrastructure
  • Kubernetes
  • distributed systems
  • cloud service architectures
  • communication skills

Nice to have

  • GPU infrastructure
  • AI/ML workloads
  • inference serving platforms
  • HPC orchestration tools (Slurm, Ray)
  • cloud provider, AI infrastructure company, or hyperscaler experience
  • Crusoe's infrastructure stack

What the JD emphasized

  • production health
  • enterprise customers
  • reliability and scalability
  • operational excellence
  • automation
  • customer experience
  • SLA/SLO performance
  • incident response
  • observability strategy
  • operational readiness
  • reliability, performance, and operational maturity metrics