Staff Software Engineer, Cape

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Staff Software Engineer to architect, design, and develop the intelligence layer for managing GPU nodes in Crusoe's AI infrastructure fleet, focusing on assignment, monetization, and lifecycle automation. This role involves building foundational systems for a vertically integrated, AI-first cloud platform.

What you'd actually do

  1. Building the Virtual Pool Service (VP Service), a physical infrastructure classification layer that serves as the single source of truth for every GPU node's state, pool membership, and transition history across Crusoe's fleet
  2. Designing and implementing Capacity Management Intelligence (CMI), the automation layer that handles priority-descending allocation, forward availability forecasting, and automated node lifecycle transitions — replacing manual spreadsheet workflows with enforced, auditable, event-driven automation
  3. Collaborating extensively across teams to architect and implement physical infrastructure management systems, availability platforms, and frameworks that meet end-to-end customer use cases
  4. Championing reliability, scalability, and security of our systems, designing high-performing, highly available cloud architectures optimized for both performance and cost-effectiveness
  5. Streamlining cloud deployment, configuration management, and operations using Go, gRPC, NATS event streaming, PostgreSQL (CNPG on Kubernetes), and Netbox as the physical source of truth

Skills

Required

  • 10+ years of relevant experience building and operating distributed systems at scale
  • Proven experience building reliable, scalable, and secure cloud platforms and running them in production
  • Strong distributed systems thinking with the ability to reason about consistency, failure modes, event ordering, and correctness invariants
  • Fluency in Go, Rust, Java, or C++; Go is our primary language
  • Collaborative, platform-minded approach to building robust systems and driving adoption across dev and ops teams
  • Ownership mentality with comfort owning a system end to end: design, implementation, testing, ops, and iteration
  • Good judgment under ambiguity, with the ability to drive open-ended technical decisions to resolution
  • Excellent communication and troubleshooting skills across cross-functional teams

Nice to have

  • Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters
  • Prior experience with event-driven architectures or message streaming systems (NATS, Kafka, Kinesis)
  • Experience with capacity planning, resource scheduling, or fleet management systems
  • Background in GPU compute, AI/ML platform infrastructure, or fast-paced startup environments
  • A passion for sustainability, clean energy, and building AI infrastructure that scales responsibly

What the JD emphasized

  • architect, design, and develop the intelligence layer
  • first engineers on both the Virtual Pool Service and Capacity Management Intelligence systems
  • shaping implementation
  • making real design decisions
  • building the foundational infrastructure
  • driving key business revenue metrics at scale
  • physical infrastructure classification layer
  • single source of truth
  • automation layer
  • priority-descending allocation
  • forward availability forecasting
  • automated node lifecycle transitions
  • replacing manual spreadsheet workflows
  • enforced, auditable, event-driven automation
  • physical infrastructure management systems
  • availability platforms
  • end-to-end customer use cases
  • reliability, scalability, and security
  • high-performing, highly available cloud architectures
  • performance and cost-effectiveness
  • cloud deployment, configuration management, and operations
  • Netbox as the physical source of truth
  • 10+ years of relevant experience building and operating distributed systems at scale
  • Proven experience building reliable, scalable, and secure cloud platforms and running them in production
  • Strong distributed systems thinking
  • reason about consistency, failure modes, event ordering, and correctness invariants
  • Ownership mentality
  • owning a system end to end: design, implementation, testing, ops, and iteration
  • Good judgment under ambiguity
  • drive open-ended technical decisions to resolution
  • Excellent communication and troubleshooting skills across cross-functional teams