Senior Hardware Systems Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Seeking a Hardware Production / Sustaining Engineer to manage the full hardware lifecycle of high-performance compute systems, focusing on debugging, validation, and production support for AI workloads. The role involves driving automation, deep issue resolution, and reliability across GPU- and CPU-based infrastructure, with specific expertise in PCIe, InfiniBand, and NVMe/storage.

What you'd actually do

  1. Drive the full hardware development and sustaining lifecycle, including feasibility, bring-up, validation, deployment, and ongoing production support.
  2. Develop and maintain scripting and automation frameworks for hardware testing, diagnostics, and continuous reliability improvements.
  3. Lead deep troubleshooting and debugging across: - PCIe (link training, topology, performance issues) - InfiniBand (fabric debugging, throughput, connectivity issues) - NVMe/storage (performance bottlenecks, firmware interactions, failure analysis)
  4. Conduct rigorous system validation and characterization for GPU, CPU, and high-performance compute platforms.
  5. Support E2E integration and solution testing to ensure Crusoe Cloud products meet performance, reliability, and scalability expectations.

Skills

Required

  • 8–10+ years of experience in hardware development, validation, sustaining engineering, or production engineering.
  • Strong hands-on expertise in PCIe, InfiniBand, and NVMe/storage debugging and development.
  • Deep proficiency in hardware bring-up, board-level debugging, and system-level validation.
  • Ability to design and implement automation frameworks for hardware testing (Python, Shell, or similar).
  • Technical background in digital and analog design, server architecture, and high-performance compute hardware.
  • Experience working across thermal, mechanical, firmware, and software functions in multidisciplinary environments.
  • Strong analytical and problem-solving skills with a data-driven approach.
  • Excellent communication and collaboration skills for working with internal teams and external partners.
  • Bachelor’s or Master’s degree in Electrical Engineering, Computer Engineering, or equivalent experience.

Nice to have

  • Experience designing or optimizing GPU-to-GPU communication architectures for AI/ML workloads.
  • Direct experience integrating NVLink or other next-generation GPU interconnect technologies.
  • Familiarity with cutting-edge GPU architectures and how to leverage them in AI/HPC environments.
  • Expertise supporting or designing systems across both ARM and x86 server architectures.
  • Background in sustainable or energy-efficient hardware design practices.
  • Advanced certifications or coursework in AI/HPC hardware systems.

What the JD emphasized

  • PCIe
  • InfiniBand
  • NVMe/storage