Research Engineer — Post-training & Small Language Models (slms), Healthcare AI

Deloitte is building an AI-first effort to rebuild the healthcare system's decision-making machinery using reasoning models and agentic systems. This role focuses on the post-training lifecycle of clinical reasoning models, including data and reward design, training, evaluation, and alignment, with a commitment to large-scale GPU compute and training infrastructure. The goal is to create faster, fairer, and less wasteful healthcare.

What you'd actually do

  1. Design and execute post-training pipelines: supervised fine-tuning (SFT), preference optimization, and reinforcement learning / alignment workflows.
  2. Build and optimize training using techniques such as SFT, RLHF, PPO, DPO, GRPO, RLAIF, and Constitutional AI, and understand how each affects reasoning quality, safety, latency, cost, and reliability.
  3. Train reasoning models for healthcare decisioning using verifiable-reward RL — designing reward signals and verifiers grounded in clinical guidelines, policy and criteria, and adjudicated outcomes.
  4. Develop reward models and preference datasets to improve reasoning quality, factuality, safety, policy adherence, and task performance.
  5. Curate, clean, synthesize, and evaluate large-scale instruction, preference, and domain-specific datasets, with rigorous filtering, deduplication, and quality control.

Skills

Required

  • Supervised fine-tuning (SFT)
  • Preference optimization
  • Reinforcement learning / alignment workflows
  • RLHF
  • PPO
  • DPO
  • GRPO
  • RLAIF
  • Constitutional AI
  • Reward modeling
  • Preference datasets
  • Data curation, cleaning, synthesis, and evaluation
  • Instruction datasets
  • LoRA
  • QLoRA
  • PEFT
  • Adapter-based approaches
  • Distributed training (DeepSpeed, FSDP, Megatron-LM, Ray)
  • Inference optimization (latency, throughput, quantization)
  • vLLM
  • TensorRT-LLM
  • TGI
  • Training and optimizing open-weight models (Llama, Qwen, Mistral, DeepSeek)
  • Building specialized small language models (SLMs)
  • Evaluation frameworks
  • Reasoning evaluation
  • Hallucination detection
  • Factuality evaluation
  • Instruction following evaluation
  • Structured outputs evaluation
  • Domain-specific metrics

Nice to have

  • Healthcare domain knowledge

What the JD emphasized

  • post-training at scale
  • real signals
  • own the post-training stack end to end
  • advanced post-training
  • post-training pipelines
  • training using techniques such as SFT, RLHF, PPO, DPO, GRPO, RLAIF, and Constitutional AI
  • train reasoning models
  • reward modeling & data
  • Efficient fine-tuning, training & inference infrastructure
  • Small language models & open-weight models
  • Evaluation, safety & red teaming

Other signals

  • AI-first effort
  • ground-up rebuild of decision-making machinery
  • post-training at scale
  • real signals for reward
  • own the post-training stack end to end
  • improve and shape model behavior through advanced post-training