Research Engineer/scientist - Human Alignment, Consumer Devices

OpenAI OpenAI · AI Frontier · San Francisco, CA · Consumer Products

Research Engineer/Scientist focused on RLHF and post-training for personalized, multimodal AI systems within the Consumer Devices group. The role involves building learning and evaluation foundations for adaptive models, working on reward modeling, preference learning, and long-horizon evaluation to improve model behavior in realistic user settings, with a strong product grounding.

What you'd actually do

  1. Develop RLHF and post-training methods for multimodal models.
  2. Build reward models and preference-learning pipelines for adaptive, personalized model behavior.
  3. Design datasets, rubrics, and evaluation frameworks that capture user preferences, contextual appropriateness, and long-term value in realistic tasks.
  4. Run experiments on policy improvement using explicit feedback, implicit signals, and model-based grading.
  5. Work on long-horizon evaluation problems, where model quality depends not just on a single response but on whether behavior improves outcomes over time.

Skills

Required

  • Machine learning research
  • RLHF
  • reward modeling
  • preference optimization
  • post-training for large models
  • reinforcement learning
  • ranking
  • recommender systems
  • personalization
  • memory
  • human-in-the-loop evaluation
  • rigorous empirical work
  • experiment design
  • reliable evals
  • decision-useful metrics
  • training models against nuanced behavioral objectives
  • building datasets
  • eval pipelines grounded in human preferences
  • rubrics
  • real-world product behavior
  • data generation
  • labeling strategy
  • training runs
  • reward functions
  • analysis
  • multimodal AI
  • learning from richer interaction signals
  • product-shaping research
  • high stakes for trust
  • alignment
  • long-term user value
  • collaboration with engineers
  • designers
  • safety researchers

Nice to have

  • RLHF
  • post-training
  • personalized AI
  • multimodal AI
  • long-horizon evaluation
  • reward modeling
  • preference learning
  • adaptive models
  • context-aware
  • user modeling
  • personalization systems
  • broader goals
  • values
  • well-being
  • immediate satisfaction
  • model behavior
  • realistic user settings
  • one-turn assistant behavior
  • systems that improve through feedback
  • richer signals
  • meaningful notions of user value
  • explicit feedback
  • implicit signals
  • model-based grading
  • policy improvement
  • behavioral decisions
  • bounded by clear constraints
  • training recipes
  • data pipelines
  • evaluation suites
  • product-relevant behaviors
  • trust
  • appropriateness
  • long-term user benefit
  • San Francisco
  • hybrid work model
  • relocation assistance

What the JD emphasized

  • long-term trust
  • aligned
  • long-term user value
  • long-horizon evaluation
  • user value
  • reward design
  • feedback loops
  • evaluation frameworks
  • user preferences
  • long-term value
  • real-world use
  • product-grounded

Other signals

  • RLHF
  • post-training
  • personalized AI
  • multimodal AI
  • long-horizon evaluation
  • reward modeling
  • preference learning