Senior Staff Engineer, Cloud Site Operations

Crusoe · Data AI · San Francisco, CA - US · Data Center Operations (DIG)

This role is for a Senior Staff Engineer in Data Center Operations at Crusoe, an AI infrastructure company. The engineer will be responsible for the technical architecture and strategic direction of data center operations, focusing on operational maturity, technical governance, and the systems that support AI workloads. Key responsibilities include overseeing technical health, developing dashboards for KPIs, defining fleet supportability and tooling, strategizing power topology, architecting business continuity and disaster recovery plans, providing technical advisory, and serving as a final technical escalation point. The role requires deep expertise in data center operations, hardware architecture (especially NVIDIA GPUs), and data-driven leadership.

What you'd actually do

  1. Oversee the technical health of our global ticket queue. Partner with internal teams to develop real-time dashboards and track the KPIs/SLAs (MTTR, fleet availability, sparing accuracy) that measure our operational maturity.
  2. Partner with the Fleet Engineering team to define the software access, diagnostic hooks, and physical tooling required for maximum repair efficiency. Act as the primary advocate for "serviceability" within the white space.
  3. Lead the initiative to map end-to-end "Power Strings," from main distribution down to cabinet PDUs. Lead the Build vs. Buy analysis to determine whether we develop internal mapping tools or procure a third-party solution.
  4. Architect the framework for our Business Continuity (BCP) and Disaster Recovery (DR) plans. Define the technical protocols for hardware recovery and site-level failovers to ensure minimal disruption to our AI Cloud customers.
  5. Provide expert guidance and architectural "sign-off" to the internal Documentation Committee. Ensure all break-fix SOPs and technical playbooks are accurate, safe, and optimized for global scale.

Skills

Required

  • 10+ years in Data Center Operations, Systems Engineering, or HPC hardware
  • expert-level understanding of x86/GPU server architecture and electrical distribution
  • Proven experience in hardware maintenance at scale
  • Deep familiarity with high-density AI infrastructure, including current NVIDIA H200 and Blackwell (GB200) systems
  • Expert proficiency in defining operational KPIs and building dashboards (e.g., Tableau, Grafana)
  • Experience performing Build vs. Buy analyses for technical tools and infrastructure software
  • Exceptional ability to distill complex technical risks, ticket-queue trends, and infrastructure hurdles into clear, actionable strategies for senior leadership

Nice to have

  • architect strategies for the transition to GB300 and Rubin platforms

What the JD emphasized

  • H200 and Blackwell (GB200)
  • GB300 and Rubin
  • Build vs. Buy