Senior Manager, Observability

Weights & Biases Weights & Biases · Data AI · Sunnyvale, CA · Technology

Senior Manager, Observability Engineering to lead a team responsible for building, scaling, and operating observability systems across metrics, logs, traces, and telemetry pipelines. This role combines technical leadership, operational ownership, and team management to ensure observability platforms scale with business and customer needs, supporting AI infrastructure.

What you'd actually do

  1. lead a team responsible for building, scaling, and operating observability systems across metrics, logs, traces, and telemetry pipelines
  2. define strategy and roadmap, drive platform reliability and performance improvements, and guide architectural decisions across observability infrastructure
  3. partner closely with infrastructure, platform, security, and application engineering teams to improve instrumentation and production visibility
  4. technical leadership, operational ownership, and team management to ensure observability platforms scale with business and customer needs

Skills

Required

  • software engineering experience with production systems at scale
  • engineering management experience
  • building and operating observability platforms
  • reliability engineering concepts including SLOs, SLIs, incident management, error budgets, and fault-tolerant design
  • scaling telemetry systems
  • distributed systems
  • performance engineering
  • hiring and managing engineering teams

Nice to have

  • OpenTelemetry, Grafana, Prometheus-compatible systems, log aggregation, and distributed tracing tools
  • operating cloud-native infrastructure, including Kubernetes environments
  • supporting large-scale cloud, developer platforms, or AI/ML infrastructure
  • capacity planning for high-ingest telemetry systems
  • scaling platforms in high-growth environments

What the JD emphasized

  • 8+ years of software engineering experience with production systems at scale
  • 4+ years of engineering management experience leading senior engineers and technical leads
  • Experience building and operating observability platforms across logs, metrics, traces, and alerting in distributed systems
  • Experience scaling telemetry systems including collection pipelines, storage backends, and query layers
  • Experience supporting large-scale cloud, developer platforms, or AI/ML infrastructure