Systems Engineer Ii, Compute

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Crusoe is an AI infrastructure company building and operating compute platforms for AI workloads. This Systems Engineer II role focuses on designing, developing, and optimizing the compute platform, specifically for virtualized AI platforms. Responsibilities include managing virtualization stacks across thousands of servers, integrating with AI hardware, optimizing performance for AI/ML workloads, and troubleshooting complex system issues. The role requires strong Linux systems knowledge, hardware integration experience, distributed systems design, and software development skills.

What you'd actually do

  1. Design highly reliable and performant Linux applications used to manage our virtualization stack across thousands of AI compute servers in multiple global datacenters.
  2. Integrate Crusoe applications with a wide variety of hardware and software AI chip-vendor stacks. Build solutions to optimize and monitor virtualized hardware (GPUs, Infiniband/ROCe NICs, Ephemeral Storage, etc.) in cutting-edge AI/HPC environments.
  3. Work side by side with our Linux Kernel and Hypervisor teams to ensure our Crusoe applications are seamlessly integrated with a variety of kernels and hypervisors.
  4. Analyze and enhance the performance of the entire virtualization stack, from the hypervisor to the virtualized guest OS, with a specific focus on optimizing AI/ML workloads. This includes profiling, bottleneck identification, and implementing low-level optimizations.
  5. Diagnose and resolve complex system issues across our virtualization stack (drivers, kernel, hypervisor, guest OS, and crusoe applications). Work closely with kernel and hypervisor teams to debug and resolve integration challenges.

Skills

Required

  • Linux kernel
  • virtualization
  • hardware tuning
  • distributed systems
  • object oriented programming
  • low-level systems programming
  • Linux systems
  • device drivers
  • memory management
  • process scheduling
  • GPUs
  • CPUs
  • Infiniband
  • Ethernet NICs
  • Ephemeral Disks
  • PCI Express
  • distributed applications
  • highly-scalable systems design
  • communications protocols (GRPC, REST, TCP/IP, etc.)
  • databases (Postgres, Redis)
  • systems design applications (Pub/Sub, Kafka)
  • software applications (Golang, Java, Python)
  • software applications (C, C++, Rust)
  • clean, maintainable code
  • unit-test driven mindset
  • Excellent Communication Skills
  • Rapid and Agile Learner
  • Virtualization Concepts
  • hypervisors
  • virtual machine lifecycles
  • Linux KVM tooling
  • CI/CD
  • Gitlab or Github CI/CD pipelines

Nice to have

  • virtualization specifically for AI/ML workloads
  • GPU virtualization
  • debugging or contributing to kernel or hypervisor code
  • configuring thousands of live compute nodes in a bare-metal production environment

What the JD emphasized

  • critical to this role
  • must
  • highly reliable and performant
  • cutting-edge AI/HPC environments
  • seamlessly integrated
  • specific focus on optimizing AI/ML workloads
  • complex system issues
  • integration challenges
  • highest level of software quality, reliability, and security
  • cohesive and integrated product development
  • technical excellence
  • Linux Systems Familiarity
  • Solid understanding of hardware devices
  • Strong grasp of distributed applications and highly-scalable systems design
  • Strong experience building software applications
  • Keen eye for clean, maintainable code
  • unit-test driven mindset
  • Excellent Communication Skills
  • Rapid and Agile Learner
  • Virtualization Concepts
  • CI/CD and Validation

Other signals

  • AI compute servers
  • AI hardware platform integration
  • optimizing and monitoring virtualized hardware
  • optimizing AI/ML workloads
  • virtualization stack