Senior Manager, Engineering - Observability Platform (remote Eligible)

Smartsheet Smartsheet · Seattle · United States · Engineering - Developers

Senior Manager of Engineering to build and lead a centralized observability platform capability that gives Smartsheet full-stack visibility into its complex systems, including AI/ML observability integrations tied to SmartAssist and agentic AI workstreams.

What you'd actually do

  1. Lead a team of engineers focused on observability platform engineering, driving build-out of a unified observability stack used by all engineering teams at Smartsheet.
  2. Own and evolve the platform's technical roadmap, consolidating multiple tooling platforms, and AI observability tooling into a coherent, scalable capability.
  3. Build and maintain AI/ML observability integrations in partnership with the AI Platform team.
  4. Partner with the Data & AI Platform team to integrate MLflow tracing, Inference Tables, and LLM-as-judge evaluation pipelines into the observability stack.
  5. Develop dashboards and alerting for agentic AI workloads, including latency, token consumption, error rates, and evaluation metric drift.

Skills

Required

  • software or platform engineering experience
  • distributed systems
  • infrastructure
  • backend services
  • engineering management experience
  • team building
  • performance management
  • cross-functional delivery ownership
  • observability tooling
  • Datadog (APM, metrics, logs, alerting)
  • OpenSearch or Elasticsearch
  • distributed tracing (OpenTelemetry or equivalent)
  • SLO/SLA management at scale
  • operating observability platforms for high-availability, high-throughput production environments
  • building and scaling engineering teams
  • complex, cross-functional infrastructure programs
  • communication (written and verbal)
  • risk identification
  • status communication
  • managing vendors
  • external delivery partners
  • third-party integrations

Nice to have

  • AI/ML observability
  • MLflow tracing
  • LLM evaluation pipelines
  • observability for agentic AI systems
  • Amazon Bedrock
  • ECS Fargate
  • LangGraph-based multi-agent architectures
  • cloud cost governance

What the JD emphasized

  • AI/ML observability integrations
  • agentic AI workstreams
  • LLM evaluation pipelines
  • AI governance and cost observability
  • Deep hands-on expertise with observability tooling
  • Proven experience operating observability platforms for high-availability, high-throughput production environments
  • Strong execution track record on complex, cross-functional infrastructure programs with high ambiguity

Other signals

  • AI/ML observability integrations
  • agentic AI workstreams
  • LLM evaluation pipelines
  • AI governance and cost observability