Infrastructure Engineer

Crusoe · Data AI · Dublin - IE · Cloud Engineering

Crusoe is an AI infrastructure company that owns and operates its own hardware and energy resources to power AI workloads. This role focuses on ensuring the reliability and stability of their GPU hardware platform through hands-on diagnosis, repair, and automation development for fleet management and maintenance.

What you'd actually do

  1. Investigating and troubleshooting problems and hardware faults that our automation can’t determine within our GPU platforms. This will involve taking data from system logs, kernel logs, BMC redfish APIs, and if the data is not there, working with hardware and kernel engineers to add information you need to make accurate determinations.
  2. Working closely with our Data Centre Operations, Hardware Engineering and Capacity Planning teams to repair and remediate failed hardware, ensure consistent delivery of new hardware to customers, and roll out new upgrades across the fleet
  3. Automate routine processes and build Crusoe’s hardware diagnostics, provisioning and repair tooling
  4. When you figure out the best way to do something, you’ll be working on building processes, documentation and tooling to help the next person who finds this problem
  5. Conducting rigorous testing and validation on such cutting-edge hardware and servers that comes back from repair

Skills

Required

  • Linux internals
  • Server-class hardware & provisioning
  • Fundamentals of Hardware and Networking
  • Excellent communication and collaboration skills
  • Bachelor's Degree in Computer Science, related field, or self-educated in computer science fundamentals

Nice to have

  • Large-scale GPU operations
  • Proficiency with at least one programming language (Python, Go, or similar)

What the JD emphasized

  • latest NVIDIA and AMD GPUs
  • cutting-edge hardware
  • latest generation AI hardware