Staff Research Engineer, Post-training & Evaluation

Reddit Reddit · Consumer · United States · Remote · Anti-Evil Engineering

Staff Research Engineer focused on the post-training and evaluation science of Reddit's foundational LLMs. This role defines the 'Reddit Benchmark' for model quality, owns evaluation reliability, designs model-as-a-judge methodologies, sets post-training strategies, and drives synthetic data generation. The goal is to ensure models are safe, smart, and 'Reddit-native', acting as a critical feedback loop for model development.

What you'd actually do

  1. Define the "Reddit Benchmark" evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what "Reddit-native" means in measurable terms and set the bar the org trains against.
  2. Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship.
  3. Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles.
  4. Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them.
  5. Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT experiments and LR studies, so we pick the right base before committing post-training compute.

Skills

Required

  • LLM post-training
  • LLM evaluation
  • evaluation reliability
  • statistical rigor
  • model-as-a-judge methodology
  • custom evaluation harnesses
  • generation evaluation
  • representation/classification evaluation
  • Continuous Pre-training (CPT)
  • Instruction Tuning (SFT)
  • Python
  • data-pipeline engineering
  • eval-harness engineering
  • Hugging Face Transformers
  • vLLM
  • lm-eval-harness
  • PyTorch
  • distributed training (FSDP2, DeepSpeed ZeRO-3)

Nice to have

  • MLflow
  • fine-tuning frameworks (Axolotl, TorchTune)
  • PyTorch-native training stacks (TorchTitan)
  • synthetic data generation techniques (Self-Instruct)
  • preference optimization (DPO, RLHF, RLAIF, GRPO)
  • Publications in NLP/ML/FAccT

What the JD emphasized

  • direct focus on LLM post-training and evaluation
  • evaluation reliability
  • custom, domain-specific evaluation harnesses
  • evaluating both generation and representation/classification
  • Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT)

Other signals

  • building foundational LLMs
  • post-training and evaluation science
  • Reddit Benchmark
  • evaluation reliability and statistical rigor
  • model-as-a-judge methodology