Senior Engineer, Network Observability

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +5 · Remote · Technology

Senior Engineer for Network Observability to design, develop, and maintain monitoring, telemetry, and observability systems for a GPU cloud network. Focus on building solutions for real-time insights into network performance, proactive issue detection, and rapid resolution. Responsibilities include developing observability platforms using Python and Golang, ingesting and unifying logs, metrics, and events, designing scalable telemetry solutions, and collaborating with network engineering, SRE, and security teams.

What you'd actually do

  1. Develop, optimize, and maintain network observability platforms. Use your skills in Python and Golang to create and automate collectors, exporters, and dashboards that provide deep visibility into network health and performance.
  2. Collaborate with Network Engineering and Platform teams to ingest and unify logs, metrics, and events from a variety of platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a single observability pipeline.
  3. Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics. Ensure advanced alerting and anomaly detection with tools such as Prometheus, Grafana, and Alertmanager.
  4. Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions across the broader infrastructure. Participate in design discussions, RFCs, and architectural decisions.
  5. Join a rotating on-call schedule to troubleshoot and resolve observability-related issues. Provide timely support to operations teams, quickly isolating and fixing problems when they arise.

Skills

Required

  • Prometheus
  • Grafana
  • Alertmanager
  • gNMI
  • SNMP
  • Python
  • Golang
  • Bash
  • Linux systems
  • IP networking
  • Arista EOS
  • NVIDIA Cumulus Linux
  • Nokia SR OS
  • SR Linux
  • Kubernetes

Nice to have

  • Extending custom metric collectors/exporters
  • Network Engineer experience
  • SRE experience
  • Software Developer experience
  • Systems Administrator experience
  • Building and operating robust telemetry and monitoring solutions
  • Automating tasks and processes
  • Configuration management and templating tools (e.g., Ansible, Jinja2)
  • Machine Learning for Anomaly Detection
  • TensorFlow
  • scikit-learn
  • Network Certifications (CCNA, CCNP)
  • Data pipelines
  • Event correlation
  • Anomaly detection in large-scale environments
  • OpenTelemetry
  • Jaeger
  • Zipkin