Member of Technical Staff - Observability

xAI xAI · AI Frontier · Palo Alto, CA · Product

Build and operate the core infrastructure for monitoring, debugging, and optimizing AI systems, handling telemetry at massive scale with strict performance and availability requirements. Own critical systems for metrics, logs, tracing, and alerting to enable engineering teams to operate services, identify issues, and drive reliability improvements.

What you'd actually do

  1. Design and implement scalable observability infrastructure for metrics, logging, and tracing.
  2. Build high-performance telemetry pipelines that handle massive ingestion volumes.
  3. Develop APIs, query engines, and UIs that allow engineers to get real-time insights into their services.
  4. Define and enforce best practices for instrumentation, alerting, and reliability across the company.
  5. Partner with infrastructure and product teams to deeply integrate observability into our internal platforms.

Skills

Required

  • Go, Rust, Scala, or similar languages
  • Deep understanding of distributed systems and telemetry architecture
  • Experience building and operating infrastructure at scale
  • Familiarity with observability stacks such as Prometheus, Grafana, OpenTelemetry, VictoriaMetrics, or ClickHouse
  • Experience with Kafka, Redis, or large-scale time series databases
  • Experience operating observability pipelines in Kubernetes or similar orchestration environments

What the JD emphasized

  • observability infrastructure
  • telemetry pipelines
  • observability stacks
  • observability pipelines