Staff Software Engineer - Cloud Availability Platform Engineering (observability)

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Crusoe is an AI infrastructure company seeking a Staff Software Engineer to lead the architecture and evolution of their observability platform at scale. This role involves defining and driving the strategy for telemetry collection, processing, and utilization across their global cloud infrastructure, which supports high-volume AI/ML, GPU, and distributed workloads. The engineer will architect and operate scalable telemetry systems (metrics, logs, traces) and drive adoption of monitoring tools like Prometheus, Grafana, and OpenTelemetry, ensuring the platform's reliability, efficiency, and scalability.

What you'd actually do

  1. Leading the architecture and long-term strategy for Crusoe’s observability platform supporting multi-region, multi-datacenter Kubernetes infrastructure
  2. Designing and operating large-scale telemetry systems (metrics, logs, traces) capable of supporting high-volume AI/ML, GPU, and distributed workloads
  3. Architecting end-to-end telemetry pipelines, including ingestion, storage, indexing, querying, and visualization layers
  4. Driving adoption and evolution of Crusoe’s monitoring ecosystem, including Prometheus, Alertmanager, Thanos/Cortex/Mimir, Grafana, and OpenTelemetry
  5. Designing highly scalable log ingestion and processing pipelines using technologies such as Fluent Bit, Vector, Loki, or ELK/OpenSearch

Skills

Required

  • infrastructure
  • platform engineering
  • distributed systems
  • observability platforms
  • telemetry systems
  • metrics platforms
  • logging pipelines
  • tracing systems
  • Go
  • Python
  • Kubernetes
  • reliability engineering
  • performance debugging
  • telemetry pipelines
  • leadership
  • mentorship

Nice to have

  • Contributions to open-source observability projects
  • AI/ML infrastructure
  • GPU-heavy compute environments
  • event streaming
  • telemetry pipelines using Kafka, NATS, or Pulsar
  • cost optimization strategies for large-scale observability platforms
  • incident response
  • chaos engineering
  • resilience testing

What the JD emphasized

  • 10+ years of experience in infrastructure, platform engineering, or distributed systems, with deep expertise in observability platforms
  • Extensive experience building and operating telemetry systems at scale
  • Strong programming skills in Go or Python, with experience building infrastructure tooling, operators, or platform automation
  • Experience operating observability platforms in large Kubernetes environments across multi-region or multi-datacenter infrastructure
  • Deep understanding of distributed systems, reliability engineering, and performance debugging in complex production environments
  • Proven ability to design and scale telemetry pipelines handling high-cardinality and high-throughput data workloads
  • Strong leadership and mentorship capabilities, helping engineers and teams improve observability maturity