Senior Software Engineer, Observability

Weights & Biases Weights & Biases · Data AI · New York, NY +1 · Technology

Senior Software Engineer on the Observability team responsible for designing, building, and maintaining core observability infrastructure (metrics, logging, tracing, telemetry) for AI workloads on GPU infrastructure. Focuses on scalable distributed systems, reliability, and performance.

What you'd actually do

  1. Design, build, and maintain core observability infrastructure spanning metrics, logging, tracing, and telemetry pipelines.
  2. Develop highly reliable and scalable systems, collaborating with internal engineering teams to embed observability best practices.
  3. Tackle performance and reliability challenges across clusters of thousands of GPUs.
  4. Contribute to platform strategy and participate in on-call rotations to ensure critical production systems remain robust and operational.

Skills

Required

  • 5+ years of experience in software or infrastructure engineering
  • designing, building, and operating large-scale distributed systems in production
  • Proficient in Go or Python
  • writing clean, testable, and resilient production code
  • Hands-on experience with Kubernetes, containerization, and microservices architectures in production environments
  • Proven ability to design and deliver scalable, robust systems with high-quality code, automated testing, and progressive release strategies
  • Skilled in decomposing complex problems in distributed architectures into manageable, well-scoped work
  • Familiar with Helm and YAML-based configurations for deploying and managing services, including templating, automation, and infrastructure-as-code practices
  • Experience participating in on-call rotations for critical production systems
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mathematics, or related field

Nice to have

  • Experience designing, operating, or scaling logging, metrics, or tracing platforms (e.g., Loki, ClickHouse, Elasticsearch, Prometheus, VictoriaMetrics, Grafana, Thanos)
  • Familiarity with data streaming systems for observability pipelines (e.g., Kafka, Kafka Connect)
  • Experience automating infrastructure provisioning using tools like Terraform
  • Knowledge of OpenTelemetry for unified telemetry collection and instrumentation
  • Exposure to modern AI workloads and GPU-based infrastructure, including large-scale training and inference

What the JD emphasized

  • large-scale distributed systems in production
  • Kubernetes, containerization, and microservices architectures in production environments
  • scalable, robust systems with high-quality code, automated testing, and progressive release strategies
  • decomposing complex problems in distributed architectures into manageable, well-scoped work
  • logging, metrics, or tracing platforms
  • data streaming systems for observability pipelines
  • automating infrastructure provisioning
  • OpenTelemetry for unified telemetry collection and instrumentation
  • modern AI workloads and GPU-based infrastructure, including large-scale training and inference