Senior Software Engineer - Observability & Irm

The Trade Desk The Trade Desk · Media · Seattle, WA · Software Engineering

Senior Software Engineer role focused on building and maintaining observability and incident response tooling for a large-scale digital advertising platform. Responsibilities include developing incident management automation, evaluating logging stacks, and extending internal developer portals. Requires experience with production infrastructure, distributed systems, and observability concepts.

What you'd actually do

  1. Incident management tooling
  2. Build and maintain automation around the incident lifecycle: alerting, escalation, incident channels, retros, and SLA tracking
  3. Help evaluate and migrate our logging stack
  4. Participate in the re-evaluation of our logging vendor and collection architecture
  5. Backstage/Service catalog — Extend our internal developer portal with K8s integrations, maturity models, and SLO adoption tooling
  6. Alert quality tooling — Build the systems that give engineers better signal and less noise — smarter routing, better grouping, tighter feedback loops between alerts and the teams that own them

Skills

Required

  • Experience building and operating production infrastructure or internal developer tooling
  • Comfort working across the stack — this role touches distributed systems, Kubernetes, observability pipelines, and web-based tooling
  • Familiarity with observability concepts: logging, alerting, on-call workflows
  • Strong debugging instincts
  • Clear communication

Nice to have

  • Experience with Grafana, Prometheus, or similar observability tools
  • Familiarity with Sumo Logic or other log management platforms
  • Prior work on developer portals or service catalog tooling (Backstage, OpsLevel, etc.)
  • Experience with Kubernetes at scale
  • A deep understanding of HunnyPt

What the JD emphasized

  • production infrastructure
  • observability concepts
  • logging
  • alerting
  • on-call workflows
  • Kubernetes at scale