People Research Data Scientist, AI Fairness & Bias

OpenAI OpenAI · AI Frontier · San Francisco, CA · People

This role focuses on establishing how OpenAI evaluates AI-assisted People systems and talent processes by designing and conducting rigorous assessments to identify, measure, and mitigate potential bias across models, agents, and automated workflows. It involves defining fairness strategies, conducting algorithmic audits, evaluating human-AI decision systems, developing approaches for generative AI and agents, investigating sources of disparities, and building scalable fairness-evaluation infrastructure.

What you'd actually do

  1. Define and lead fairness and bias-testing strategies for AI-assisted People processes, models, agents, and decision-support systems from development through deployment and ongoing monitoring.
  2. Design rigorous algorithmic audits and validation studies, including adverse-impact analysis, subgroup and intersectional evaluation, error-rate analysis, calibration, measurement invariance, reliability, criterion-related validity, and sensitivity testing.
  3. Identify the appropriate fairness criteria for each use case, evaluate tradeoffs among competing definitions of fairness, and clearly document the assumptions, limitations, and residual risks of each approach.
  4. Evaluate end-to-end human-AI decision systems, including model outputs, user behavior, human overrides, escalation pathways, and whether AI assistance changes the quality, consistency, or equity of decisions.
  5. Develop evaluation approaches for generative and agentic AI, including test-set design, counterfactual testing, behavioral evaluation, human-rating studies, robustness testing, and analysis of disparate performance across populations and contexts.

Skills

Required

  • Python or R
  • SQL
  • algorithmic fairness
  • bias measurement
  • responsible AI
  • psychometrics
  • applied statistics
  • evaluation of high-impact decision systems
  • research design
  • measurement
  • experimentation
  • causal inference
  • statistical modeling
  • subgroup and intersectional analysis
  • adverse-impact testing
  • equalized-odds and equal-opportunity analysis
  • demographic-parity assessment
  • calibration analysis
  • counterfactual testing
  • measurement invariance
  • reliability analysis
  • validation studies
  • fairness metrics
  • machine-learning models
  • generative AI systems
  • agents
  • human-AI workflows
  • reproducible evaluation pipelines
  • automated testing frameworks
  • analytical tools
  • monitoring systems
  • governed research workflows
  • stakeholder communication

Nice to have

  • Fairlearn
  • AI Fairness 360
  • responsible-AI evaluation frameworks
  • explainability methods
  • evaluating large language models

What the JD emphasized

  • rigorous assessments
  • identify, measure, and mitigate potential bias
  • evaluate both technical systems and the broader human-AI decision processes
  • defensible evaluation strategies
  • scalable testing infrastructure
  • clear recommendations
  • rigorous algorithmic audits and validation studies
  • appropriate fairness criteria
  • evaluate tradeoffs among competing definitions of fairness
  • clearly document the assumptions, limitations, and residual risks
  • evaluate end-to-end human-AI decision systems
  • Investigate the sources of observed disparities
  • recommend and evaluate mitigations
  • Build scalable fairness-evaluation infrastructure
  • Establish research and documentation standards
  • Translate complex findings into concise, decision-ready narratives
  • Deep expertise in algorithmic fairness, bias measurement, responsible AI, psychometrics, applied statistics, or the evaluation of high-impact decision systems.
  • Exceptional strength in research design, measurement, experimentation, causal inference, and statistical modeling.
  • Hands-on experience applying methods such as subgroup and intersectional analysis, adverse-impact testing, equalized-odds and equal-opportunity analysis, demographic-parity assessment, calibration analysis, counterfactual testing, measurement invariance, reliability analysis, and validation studies.
  • Strong judgment about the limitations of fairness metrics
  • Experience evaluating machine-learning models, generative AI systems, agents, or human-AI workflows using quantitative and qualitative evidence.
  • High proficiency in Python or R and SQL
  • Experience building reproducible evaluation pipelines, automated testing frameworks, analytical tools, monitoring systems, or governed research workflows.
  • Ability to distinguish statistical disparities from their potential causes
  • communicate findings without overstating certainty or making unsupported causal or legal conclusions.
  • Ability to work effectively with technical, operational, legal, privacy, and executive stakeholders
  • influence consequential decisions through evidence and sound judgment.

Other signals

  • AI fairness and bias testing
  • evaluating AI-assisted People systems
  • identify, measure, and mitigate potential bias