Principal Applied Scientist, Agentic AI

Zillow Zillow · Consumer · United States · Remote

Principal Applied Scientist role focused on RL post-training for production models. Leads design and deployment of learning systems to shape model behavior, aligning with user value, safety, and business objectives. Involves supervised fine-tuning, preference modeling, RL-based alignment (RLHF, RLAIF, DPO), reward model development, and translating data into training signals. Partners with model/platform teams on training/evaluation efficiency and mentors other scientists.

What you'd actually do

  1. Lead the technical direction and strategy for RL post-training of production models, partnering with other scientists, engineers, and product leaders to align models with customer and business needs.
  2. Design and implement post-training pipelines that combine techniques such as supervised fine-tuning on curated demonstrations, preference modeling and pairwise ranking, and RL-based alignment approaches like RLHF, RLAIF, or DPO for multi-objective optimization.
  3. Develop reward models and objective formulations that balance constraints such as helpfulness, safety, fairness, compliance, and customer satisfaction, and iterate on them using human and AI feedback at scale through online and batch adaptation loops with strong guardrails.
  4. Translate conversational logs, behavioral signals, and structured attributes into training, reward, and evaluation signals for post-training and reinforcement learning, turning heterogeneous data into actionable supervision.
  5. Partner with model and platform teams to improve the efficiency and robustness of training and evaluation, including off-policy evaluation, replay strategies, controlled rollouts, and metrics and evaluation frameworks such as win-rates versus baselines, safety and quality metrics, and expert-review programs.

Skills

Required

  • Reinforcement learning
  • post-training methods
  • production models
  • supervised fine-tuning
  • DPO
  • RLHF/RLAIF
  • preference modeling
  • multi-objective optimization
  • evaluation and monitoring of aligned models
  • win-rate experiments
  • human and AI feedback loops
  • long-horizon evaluation
  • safety or guardrail metrics
  • modern transformer-based models
  • LLMs
  • multimodal models
  • vector search
  • orchestration frameworks
  • cross-functional partners
  • domains where safety, trust, or regulation matter
  • technical leadership
  • mentorship
  • communication of complex technical ideas

Nice to have

  • publication record
  • open-source contributions

What the JD emphasized

  • RL post-training
  • production models
  • user value, safety, and business objectives
  • post-training and adaptation
  • reward models
  • multi-objective optimization
  • helpfulness, safety, fairness, compliance, and customer satisfaction
  • human and AI feedback at scale
  • guardrails
  • reinforcement learning
  • evaluation and monitoring of aligned models
  • safety, trust, or regulation matter

Other signals

  • post-training
  • RLHF
  • reward models
  • alignment
  • production models