Staff Applied Scientist - Dashboards

Datadog Datadog · Enterprise · New York, NY · Dev Eng

Staff Applied Scientist focused on defining and guaranteeing the quality of an AI system (Dashboards product) at scale. This involves owning the evaluation strategy, building eval datasets and regression harnesses, and driving improvements in retrieval relevance and tool-selection accuracy. The role requires leadership in ML/GenAI initiatives and significant experience in evaluation and measurement of ML systems.

What you'd actually do

  1. Own the evaluation strategy for Dashboards, as well as sister teams within our organization. Define the metrics — offline and online, quality and cost, single-turn and multi-turn — that the team and the broader organization optimize against.
  2. Build the eval datasets, golden traces, and regression harnesses that catch quality changes before they hit customers, and make those assets reusable by every team that is building dashboards and widgets through agents
  3. Drive measurable improvements to retrieval relevance, tool-selection accuracy, and context efficiency, partnering closely with the engineers on the team.
  4. Provide technical leadership across the Dashboards team and the broader organization through design reviews, working groups, and mentorship.

Skills

Required

  • BS/MS/PhD in a scientific field, or equivalent experience
  • 10+ years of relevant engineering or applied science experience, including time as a technical lead
  • Proven track record of leading ML or GenAI initiatives in a product-driven environment, from research through production
  • Significant experience with evaluation, experimentation, or measurement of ML systems at scale
  • Strong product mindset
  • Comfortable driving initiatives across cross-functional teams
  • Ability to make sound technical calls when the path isn’t yet defined

Nice to have

  • Experience with Datadog products
  • Experience in observability
  • Experience with hybrid workplace environments

What the JD emphasized

  • tool-selection accuracy (critical given the growing catalog of data sources and visualizations)
  • evaluation strategy
  • evaluation, experimentation, or measurement of ML systems at scale

Other signals

  • Defining and guaranteeing the quality of an AI system at scale
  • Evaluating agent end-to-end with non-deterministic trajectories
  • Scoring tool selection accuracy against numerous data sources and visualizations
  • Building measurement systems for regressions across all widget types and data sources
  • Driving improvements in retrieval relevance, tool-selection accuracy, and context efficiency