Data Scientist, Integrity Measurement

OpenAI OpenAI · AI Frontier · San Francisco, CA · Data Science

This role focuses on developing and implementing AI-first methods for measuring and monitoring complex harms on OpenAI's platforms. The Data Scientist will own measurement and metrics for severe usage harms, build productionised safety metrics, optimize LLM prompts for measurement, and leverage agentic products for automation. The role is crucial for ensuring the integrity and security of OpenAI's scaling technology.

What you'd actually do

  1. Own measurement and quantitative analysis for a group of severe, actor- and network-based usage harm verticals.
  2. Develop and implement AI-first methods for prevalence measurement and other productionised safety metrics, which may necessarily include off-platform indicators or other non-standard datasets.
  3. Build metrics that can be used for goaling or A/B tests when prevalence or other top line metrics are not suitable.
  4. Own dashboards and metrics reporting for harm verticals.
  5. Conduct analyses and generate insights that inform improvements to review, detection, or enforcement, and that influence roadmaps.

Skills

Required

  • Data programming languages (R or python, SQL)
  • Statistics skills
  • Sampling methods
  • Prevalence estimation
  • Trust and safety experience
  • Measurement direction

Nice to have

  • Experience with AI harms
  • Leveraging AI for measurement
  • Experience with severe and sensitive harm areas (child safety, violence)
  • Cross-functional collaboration skills

What the JD emphasized

  • senior DS with trust and safety experience that can drive measurement direction
  • deep statistics skills, specifically around sampling methods and prevalence estimation of complicated problem areas
  • experience working with severe and sensitive harm areas like child safety or violence
  • AI harms or leveraging AI for measurement

Other signals

  • measurement for complex, actor- and sometimes network-level harms
  • AI-first methods for prevalence measurement
  • productionised safety metrics
  • optimise LLM prompts for the purpose of measurement
  • leveraging our agentic products