Senior Virtualization Validation Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role focuses on validating large-scale, multi-node GPU clusters using virtualization technologies like QEMU and Cloud Hypervisor. The engineer will test high-performance interconnects (NVLink, InfiniBand) and collective communication libraries (NCCL/RCCL) to ensure efficient scaling and low-latency communication for AI and HPC applications. Responsibilities include designing and executing scaling tests, validating hypervisor configurations (PCIe passthrough, IOMMU), benchmarking communication libraries, and developing automation frameworks for testing and performance analysis.

What you'd actually do

  1. Multi-Node Scaling Validation
  2. Interconnect & Fabric Testing
  3. Hypervisor & GPU Virtualization
  4. Collective Communication Benchmarking
  5. Network Stack Validation

Skills

Required

  • 5+ YOE
  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field
  • QEMU/KVM
  • Cloud Hypervisor
  • NVIDIA (CUDA/NCCL)
  • AMD (ROCm/RCCL)
  • RDMA
  • RoCE
  • InfiniBand
  • Linux kernel internals
  • PCIe topology
  • VFIO
  • memory management
  • HugePages
  • IOMMU
  • Python
  • Bash

Nice to have

  • MNNVL (Multi-Node NVLink)
  • specialized AI fabric architectures
  • hardware-level debugging tools
  • performance profilers (e.g., NVIDIA Nsight, AMD Omniperf)
  • Kubernetes with specialized device plugins

What the JD emphasized

  • high-performance GPU Virtualization
  • QEMU
  • Cloud Hypervisor
  • distributed workloads
  • multi-node virtualized nodes
  • interconnect fabric
  • collective communication libraries
  • NCCL/RCCL
  • NVLink
  • Infinity Fabric
  • InfiniBand
  • RoCE
  • PCIe passthrough (VFIO)
  • IOMMU
  • direct device assignment
  • nccl-tests
  • rccl-tests
  • AllReduce
  • AllGather
  • SR-IOV
  • RDMA
  • Python
  • Go
  • Linux kernel
  • VFIO
  • memory management
  • HugePages
  • IOMMU