Staff+ Software Engineer, Observability

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff+ Software Engineer, Observability at Anthropic. This role focuses on building and operating monitoring and telemetry infrastructure (metrics, logging, tracing, error analytics) for AI systems. The engineer will design and build scalable ingest and storage pipelines, own core observability platforms, develop instrumentation, drive alerting and SLO infrastructure, and reduce MTTR by building cross-signal correlation and AI-assisted diagnostic tooling. The role involves partnering with Research, Inference, Product, and Infrastructure teams. Experience with high-throughput data pipelines, columnar storage, and observability platforms is required. Interest in applying AI/LLMs to operational workflows is a plus.

What you'd actually do

  1. Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic’s multi-cluster infrastructure
  2. Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organizational growth
  3. Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services
  4. Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise
  5. Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling

Skills

Required

  • Python
  • Rust
  • Go
  • observability platforms
  • Prometheus
  • Grafana
  • ClickHouse
  • OpenTelemetry
  • high-throughput data pipelines
  • columnar storage engines

Nice to have

  • very high cardinality metrics systems
  • log storage migrations
  • BigQuery
  • OpenTelemetry collector pipelines
  • tail-based sampling strategies
  • alerting platforms
  • on-call tooling
  • SLO frameworks
  • Kubernetes-native monitoring
  • eBPF-based observability
  • continuous profiling
  • applying AI/LLMs to operational workflows
  • automated root cause analysis
  • anomaly detection
  • intelligent alerting

What the JD emphasized

  • 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure
  • high-throughput data pipelines
  • ingesting and querying telemetry data at scale
  • operating metrics systems at very high cardinality
  • operating columnar databases

Other signals

  • building next-generation observability systems
  • high-throughput ingest pipelines
  • cost-efficient columnar storage
  • unified query layers across signals
  • agentic diagnostic tools