Senior / Staff Software Engineer (observability / Sre)

Waabi Waabi · Robotics · Toronto, ON +3 · Remote · Software Engineering

Senior/Staff Software Engineer focused on designing, leading, and developing Waabi's observability and SRE stack for cloud and on-prem environments. This includes developing workloads and benchmarks for ML/AI, optimizing performance across various system layers, and building automation tooling for CI/CD, telemetry, and anomaly detection. The role also involves supporting client teams and influencing system architecture for scalability and monitoring.

What you'd actually do

  1. Design and lead the architecture and development of Waabi’s monitoring and observability stack, used to monitor the health and performance of cloud and on-prem environments.
  2. Develop and extend workloads and benchmarks (compute, storage, network, ML/AI) and integrate stress, chaos, and regression tests to validate hardware and platform choices.
  3. Analyze and optimize end-to-end performance across hardware, firmware, Linux kernel, runtimes, and distributed services using advanced profiling tools (perf, eBPF, flamegraphs, tracing frameworks).
  4. Build automation and observability tooling (Go/Python/Java, Kubernetes/Docker) for CI/CD-based performance regression detection, telemetry, alerting, and anomaly detection.
  5. Work with client teams to support their applications’ observability requirements.

Skills

Required

  • 5+ years software engineering or systems/performance engineering experience
  • Proficient in at least one of: Python, Rust, C/C++
  • strong CS fundamentals and system design skills
  • Hands-on with Linux internals (CPU scheduling, memory, I/O, networking)
  • perf tooling (perf, eBPF, flamegraphs, tracing frameworks)
  • Experience with Kubernetes, microservices, and distributed systems
  • comfort building production services and pipelines
  • Proven track record of clear communication, writing design docs, and leading cross-functional efforts.

Nice to have

  • Experience deploying and managing observability platforms (OpenTelemetry, Grafana OSS)
  • Performance tuning for databases/streaming/batch/ML platforms
  • GPU/xPU or Arm performance exposure
  • Experience tuning stream processing, batch or ML platforms (e.g. Argo Workflows, PyTorch)
  • Familiarity with microservices debugging and distributed tracing (OpenTelemetry, Prometheus)

What the JD emphasized

  • ML/AI workloads and benchmarks
  • performance tuning for databases/streaming/batch/ML platforms

Other signals

  • monitoring and observability stack
  • ML/AI workloads and benchmarks
  • performance across hardware, firmware, Linux kernel, runtimes, and distributed services
  • automation and observability tooling
  • anomaly detection
  • support their applications’ observability requirements
  • system architecture and tooling decisions
  • scales its infrastructure