Principal Systems Software Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Principal Systems Architect role focused on designing and leading the development of next-generation AI infrastructure, specifically the I/O path for massive-scale AI workloads. This involves unifying Bare-Metal-as-a-Service, Intelligent IaaS, and Elastic CaaS, optimizing hardware-software co-design, and leading R&D teams to ship production-grade kernel and orchestration code. The role requires deep expertise in Linux kernel, virtualization, high-performance networking, and experience from hyperscale environments.

What you'd actually do

  1. Architect systems that deliver raw GPU throughput via zero-latency InfiniBand/RDMA fabrics for massive-scale training.
  2. Design highly optimized, thin virtualization layers using KVM or custom micro-VMs to provide enterprise-grade isolation without the "virtualization tax."
  3. Build a high-performance container substrate (utilizing Kubernetes or Slurm) that allows AI workloads to burst and scale across heterogeneous GPU nodes.
  4. Lead the architectural design of our internal cloud fabric, drawing on experience from top-tier hyperscalers to drive the technical roadmap for SR-IOV, RDMA, and virtualized GPU scheduling.
  5. Lead elite workstreams to prototype and productionize novel methods for managing memory, networking, and compute that don't yet exist in standard cloud distributions.

Skills

Required

  • Linux kernel
  • virtualization internals (KVM, QEMU, Firecracker)
  • high-performance networking (RoCE v2, InfiniBand)
  • hardware-software co-design
  • NVIDIA/AMD GPUs
  • high-speed NICs
  • R&D leadership
  • distributed systems
  • memory-mapped I/O
  • GPU clusters

Nice to have

  • Patents related to network virtualization, GPU scheduling, or distributed file systems
  • Maintainer status or significant contributions to the Linux Kernel, Kubernetes, or specialized HPC projects
  • Direct experience optimizing infrastructure for Large Language Model (LLM) training and inference at scale

What the JD emphasized

  • industry-recognized expert
  • seen the movie at hyperscale
  • lead elite R&D teams in shipping production-grade kernel and orchestration code
  • push massive-scale training workloads to the theoretical limits of hardware
  • Hyperscale Provenance: 12+ years of experience designing and shipping core infrastructure at a major hyperscaler
  • Authoritative knowledge of the Linux kernel, virtualization internals (KVM, QEMU, Firecracker), and high-performance networking (RoCE v2, InfiniBand)
  • Proven ability to design software that maximizes the performance of NVIDIA/AMD GPUs and high-speed NICs
  • Experience leading cross-functional teams through high-ambiguity projects and delivering production-ready, mission-critical systems

Other signals

  • Designing and shipping core infrastructure at a major hyperscaler
  • Architect systems that deliver raw GPU throughput via zero-latency InfiniBand/RDMA fabrics for massive-scale training
  • Lead the architectural design of our internal cloud fabric, drawing on experience from top-tier hyperscalers to drive the technical roadmap for SR-IOV, RDMA, and virtualized GPU scheduling
  • Prototype and productionize novel methods for managing memory, networking, and compute that don't yet exist in standard cloud distributions