Senior Staff Software Engineer, Cape

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role focuses on building and operating the intelligence layer for GPU node assignment, monetization, and management within an AI infrastructure company. It involves designing and implementing systems for physical infrastructure classification, capacity management, and automation, with a strong emphasis on distributed systems, reliability, scalability, and security. The role requires setting technical vision, architectural decisions, and influencing engineering culture.

What you'd actually do

  1. Building the Virtual Pool Service (VP Service), a physical infrastructure classification layer that serves as the single source of truth for every GPU node's state, pool membership, and transition history across Crusoe's fleet
  2. Designing and implementing Capacity Management Intelligence (CMI), the automation layer replacing manual spreadsheet workflows with enforced, auditable, event-driven automation for allocation, forecasting, and node lifecycle transitions
  3. Collaborating extensively across teams to architect physical infrastructure management systems, availability platforms, and frameworks that meet end-to-end customer use cases
  4. Championing reliability, scalability, and security of our systems, designing highly available cloud architectures optimized for performance and cost-effectiveness
  5. Streamlining cloud deployment, configuration management, and operations using Go, gRPC, NATS event streaming, PostgreSQL (CNPG on Kubernetes), and Netbox

Skills

Required

  • Go
  • gRPC
  • NATS event streaming
  • PostgreSQL
  • Kubernetes
  • distributed systems
  • cloud platforms
  • reliability
  • scalability
  • security
  • Rust
  • Java
  • C++
  • capacity planning
  • resource scheduling
  • fleet management systems
  • GPU compute
  • AI/ML platform infrastructure

Nice to have

  • Netbox
  • event-driven architectures
  • message streaming systems
  • startup environments

What the JD emphasized

  • set the technical vision
  • lead the development
  • define architecture
  • make foundational design decisions
  • shape the engineering culture
  • physical infrastructure classification layer
  • Capacity Management Intelligence (CMI)
  • event-driven automation
  • end-to-end customer use cases
  • highly available cloud architectures
  • multi-quarter roadmap planning
  • architectural governance
  • company-wide technical forums
  • influencing cross-organizational standards
  • multi-system trade-offs
  • driving alignment
  • organizational engineering capability
  • mentorship programs
  • hiring bar definition
  • onboarding frameworks
  • 12+ years of relevant experience
  • building and operating distributed systems at scale
  • Proven experience building reliable, scalable, and secure cloud platforms
  • running them in production
  • Strong distributed systems thinking
  • reason about consistency, failure modes, event ordering, and correctness invariants
  • Fluency in Go, Rust, Java, or C++
  • Demonstrated ability to define and drive multi-year technical strategy
  • platform-scale systems
  • visible company-level impact
  • track record of independently owning ambiguous, high-stakes technical problems
  • delivering results
  • Experience influencing engineering culture and standards beyond your immediate team
  • hiring, design review processes, or org-wide tooling adoption
  • Excellent communication and troubleshooting skills
  • cross-functional teams
  • Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters
  • Prior experience with event-driven architectures or message streaming systems
  • Experience with capacity planning, resource scheduling, or fleet management systems
  • Background in GPU compute, AI/ML platform infrastructure
  • fast-paced startup environments