Staff Software Engineer

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role focuses on developing software for managing a fleet of GPU servers and data centers, with an emphasis on diagnostics, observability, automation, and repair tooling for high-performance GPU compute clusters. It involves developing AI agents for hardware diagnosis and remediation, and tooling for critical environment management and post-repair validation.

What you'd actually do

  1. Developing and implementing deep-level diagnostics and troubleshooting of hardware faults within GPU racks and high-density compute systems.
  2. Developing troubleshooting and automation tooling for GPU platforms including NVIDIA A100, H200, GB200, B200 and AMD 350X / 355X.
  3. Developing automation and AI agents for executing component-level diagnosis and remediation for failed or degraded hardware.
  4. In conjunction with data center operations develop innovative tooling and AI agents for managing the critical environment.
  5. Developing tooling for post-repair validation and testing tools such as burn-in, Pytorch, and NVIDIA NCCL to ensure system stability and performance.

Skills

Required

  • Software engineering experience
  • The ability to identify a problem, rapidly develop a scalable solution and ship it.
  • Ability to lean in and assist team members working on critical or complex technical initiatives.
  • Ability to set the technical direction for a specific project and execute.
  • Expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, GCP etc.)
  • Strength in at least one programming language - Go, Python, Java, Rust.
  • Strong analytical and problem-solving skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and within a team

Nice to have

  • Experience with Temporal and Kubernetes.
  • Experience working directly with hardware vendors.
  • Background in large-scale GPU fleet operations or hyperscale data center environments.

What the JD emphasized

  • rapidly develop a scalable solution and ship it
  • AI agents

Other signals

  • Developing automation and AI agents for executing component-level diagnosis and remediation for failed or degraded hardware.
  • In conjunction with data center operations develop innovative tooling and AI agents for managing the critical environment.