Research Engineer, Post-training

Harvey Harvey · AI Frontier · San Francisco, CA · Engineering

Research Engineer focused on post-training LLMs to improve agent performance for legal services. This role involves driving training experiments, optimizing agent harnesses, designing reward systems, and studying agent behavior to inform model improvements. Requires hands-on experience with post-training techniques and strong Python/research-engineering skills.

What you'd actually do

  1. Drive post-training experiments, pushing agent performance while navigating the Pareto frontier of cost, latency, security, and governance.
  2. Optimize agent harnesses, including domain-specific skills, tools, subagents, retrieval strategies, and validation loops that improve quality on long-horizon legal work.
  3. Design and develop grading and reward systems that are reliable enough for evaluation, efficient enough for iteration, and strict enough for high-stakes legal work.
  4. Study agent behavior, identifying patterns that correlate with successful work product, and converting those findings into training data, evals, or harness changes.
  5. Work with Harvey researchers and external research partners to define experiments, evaluate methodology, review results, and keep projects moving toward concrete model improvements.

Skills

Required

  • post-training
  • model training
  • SFT
  • preference optimization
  • RLHF/RLAIF
  • reward modeling
  • distillation
  • adapting open-weight models
  • Python
  • research-engineering
  • experiment debugging
  • self-management
  • ambiguous applied research projects

Nice to have

  • data infrastructure for ML
  • evaluation infrastructure for ML
  • dataset curation pipelines
  • model-output processing
  • experiment tracking
  • evaluation dashboards
  • regression analysis tooling
  • distributed training
  • inference systems
  • GPU workloads
  • large-scale ML experimentation
  • research publications
  • open-source contributions
  • shipped industry work in LLMs
  • agents
  • evaluation
  • ML systems

What the JD emphasized

  • Hands-on experience with post-training or model-training work, such as SFT, preference optimization, RLHF/RLAIF, reward modeling, distillation, or adapting open-weight models to specialized domains.
  • Strong judgment about model behavior: you can read traces, inspect outputs, identify failure modes, and reason about whether a metric is measuring the thing that matters.
  • Ability to self-manage ambiguous applied research projects

Other signals

  • post-training
  • agent performance
  • model training experiments
  • applied research