Senior Staff Cloud Support Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Go-To-Market (GTM)

Senior Staff Cloud Support Engineer role focused on supporting and improving AI/ML infrastructure, including GPU clusters, distributed training, and inference. The role involves technical leadership, incident response, reliability architecture, and customer-facing authority, with a strong emphasis on Kubernetes, networking (Infiniband, RDMA, RoCE), and Linux systems.

What you'd actually do

  1. Serve as highest-level escalation point for complex P1/P0 incidents.
  2. Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.
  3. Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds.
  4. Design and improve node validation, burn-in processes, performance baselining, and release readiness.
  5. Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.

Skills

Required

  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles.
  • Advanced Linux systems expertise.
  • Deep Kubernetes operational experience (CKA-level or higher).
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN.
  • Experience supporting AI/ML workloads at scale (GPU clusters).
  • Proven track record of resolving multi-layer, distributed system failures.
  • Strong customer communication and executive-facing presence.

What the JD emphasized

  • AI/ML infrastructure
  • complex P1/P0 incidents
  • distributed system failures
  • AI/ML cluster stability
  • AI workloads at scale

Other signals

  • AI/ML infrastructure
  • GPU clusters
  • distributed training
  • inference
  • Kubernetes
  • Infiniband
  • RDMA
  • RoCE