Staff Engineer, Command Center Insights & Actions

Crusoe Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Staff Engineer role focused on building and scaling the detection and intelligence systems for Crusoe's Command Center platform. This involves defining heuristics, thresholds, and rules for alerting and anomaly detection, integrating ML/RL techniques where appropriate, and shipping customer-facing features. The role requires expertise in anomaly detection, distributed systems, and production software engineering, with a focus on infrastructure telemetry and actionable insights.

What you'd actually do

  1. Own the full detection stack — heuristics, threshold calibration, precision/recall tuning, and the rule systems that define what "something is wrong" means for the platform.
  2. Design and maintain detection systems including straggler node detection, GPU health signals, and fleet-level behavioral baselines.
  3. Drive detection fidelity by reducing false positives, increasing signal coverage, and building feedback loops that keep thresholds accurate as the fleet grows.
  4. Evaluate and integrate machine learning and reinforcement learning techniques where they outperform rule-based approaches — and know when not to reach for a model.
  5. Ship customer-facing features end-to-end across the CCIA stack — alert rule engine, control plane APIs, automated action systems, and insights delivery surfaces.

Skills

Required

  • Anomaly Detection & Heuristics Expertise
  • Threshold & Signal Calibration
  • Distributed Systems Fundamentals
  • Full Software Engineering Craft (5+ years shipping production software; experience with modern compiled or systems languages like Go, Rust, C++, Java)
  • Data & Observability Fluency (time-series data, telemetry pipelines, observability primitives)
  • Communication skills

Nice to have

  • GPU profiling tools (Nsight, NCCL Inspector) or hardware-level infrastructure diagnostics
  • Observability platforms or products
  • Reinforcement learning applied to operational or infrastructure problems
  • Large-scale fleet management or cloud infrastructure
  • Building team culture and engineering quality of life

What the JD emphasized

  • deep expertise in anomaly detection, heuristics, and machine/reinforcement learning—applied to real infrastructure at global scale
  • 5+ years shipping production software
  • Deep experience building anomaly detection systems, heuristics-based rule engines, or ML/RL systems for infrastructure or data-intensive domains.
  • Demonstrated ability to reason about precision/recall trade-offs and build feedback loops that keep detection systems accurate over time.

Other signals

  • anomaly detection
  • ML/RL integration
  • infrastructure telemetry
  • customer-facing features