Incident Manager

Crusoe · Data AI · San Francisco, CA - US · Cloud Go-To-Market (GTM)

This role is responsible for managing incidents and customer escalations related to AI compute infrastructure, focusing on service reliability and customer trust. The Incident Manager will lead crisis responses, utilize data analytics to improve system resiliency, and collaborate with engineering teams to resolve complex technical issues. The role also involves developing preventative strategies and educating customers on optimizing their HPC infrastructure.

What you'd actually do

  1. Handle the "Storm": Lead incident responses for high-visibility issues, ensuring minimal disruption to customer operations. You will act as the calm anchor during crises, managing communication and strategy to maintain customer trust during outages or critical failures.
  2. Analytics & Reliability: Utilize data analytics to identify trends in incidents, translating these insights into actionable strategies for greater system resiliency and reliability.
  3. Preventative Strategy: Develop robust incident response strategies and designs. Focus on the "preventative piece" by conducting deep post-incident reviews to ensure root causes are addressed and recurrences are eliminated.
  4. Troubleshoot and Resolve: Diagnose and resolve complex technical issues related to Infiniband, containerization, and distributed training.
  5. Collaborate Internally: Work closely with internal engineering and product teams to provide valuable customer feedback. You will act as a key technical resource, helping our Customer Support Engineers (CSEs) and Customer Success Managers (CSMs) understand and resolve complex product issues.

Skills

Required

  • Linux
  • Virtualization
  • Kubernetes
  • customer incidents
  • TCP/IP stack
  • Infrastructure-as-Code (IaC)
  • crisis management
  • problem-solving
  • communication skills

Nice to have

  • NVIDIA certifications
  • Linux certifications
  • Kubernetes certifications
  • Programming skills

What the JD emphasized

  • customer trust
  • service reliability
  • customer experience
  • system resiliency
  • reliability
  • complex technical issues