Staff Cloud Hypervisor R&d

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Staff Cloud Hypervisor R&D Engineer to architect and implement a next-generation virtualization stack optimized for massive-scale GPU fleets, aiming to eliminate the virtualization tax and deliver bare-metal performance for AI/ML workloads.

What you'd actually do

  1. Lead the R&D and implementation of core hypervisor components (KVM, QEMU, or custom Rust-based solutions) specifically optimized for massive-scale GPU fleets.
  2. Develop and refine advanced hardware pass-through and abstraction techniques (SR-IOV, VFIO, mdev) to ensure NVIDIA GPUs and BlueField DPUs operate with near-zero latency in a multi-tenant environment.
  3. Solve high-stakes technical hurdles such as live migration for AI workloads with 80GB+ VRAM and optimizing PCIe peer-to-peer communication between virtualized accelerators.
  4. Conduct deep-dive bottleneck analysis across the entire stack—from CPU microarchitecture and MMU virtualization to guest OS scheduling—to minimize jitter and maximize throughput.
  5. Actively contribute to and maintain upstream open-source virtualization projects, positioning Crusoe as a thought leader in the Linux kernel and virtualization communities.

Skills

Required

  • Hypervisor internals
  • Kernel development
  • Low-level systems programming
  • CPU virtualization (Intel VT-x, AMD-V)
  • Memory virtualization (EPT/NPT, HugePages)
  • C
  • C++
  • VirtIO
  • vhost-user
  • Hardware-accelerated I/O paths
  • Technical leadership

Nice to have

  • Rust for modern systems programming
  • QEMU
  • KVM
  • Linux kernel debugging tools (perf, ftrace, eBPF)
  • Specialized AI hardware (GPUs, InfiniBand/RoCE NICs, SmartNICs/DPUs)

What the JD emphasized

  • GPU fleets
  • NVIDIA GPUs
  • BlueField DPUs
  • 80GB+ VRAM
  • PCIe peer-to-peer communication
  • AI/ML workloads
  • virtualization tax
  • bare-metal performance