Staff Software Engineer, Observability

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

Staff Software Engineer focused on building and maintaining scalable observability systems (logging, tracing, metrics) for a cloud provider specializing in AI infrastructure. This role involves leading engineers, managing production clusters, and ensuring reliability of critical infrastructure.

What you'd actually do

  1. Lead and mentor engineers, fostering a culture of collaboration and continuous improvement.
  2. Scale logging, tracing, and metrics platforms to support a global datacenter footprint.
  3. Develop and refine monitoring and alerting to enhance system reliability.
  4. Advise engineers across CoreWeave on optimal usage of Observability systems.
  5. Automate interactions with CoreWeave’s Compute Infrastructure layer.

Skills

Required

  • Software Engineering
  • Site Reliability Engineering
  • DevOps
  • ClickHouse
  • Elastic
  • Loki
  • Victoria Metrics
  • Prometheus
  • Thanos
  • Grafana
  • Kubernetes
  • containerization
  • microservices architectures
  • incident management
  • post-mortem analysis

Nice to have

  • running and scaling observability tools as a cloud provider
  • administering large-scale kubernetes clusters
  • data-streaming systems

What the JD emphasized

  • Observability
  • logging
  • tracing
  • metrics
  • Kubernetes