Senior Staff Software Engineer, AI Model Lifecycle

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Senior Staff Software Engineer focused on building and managing the AI model lifecycle, including fine-tuning, training, and dataset management for large foundation models and LLMs.

What you'd actually do

  1. Manage fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling.
  2. Implement and maintain end-to-end training pipelines for Large Language Models.
  3. RFT and Reinforcement learning to the fine tuning and training sections
  4. Distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling).
  5. Dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale.

Skills

Required

  • Generative AI (Large Language Models, Multimodal)
  • training LLMs
  • fine-tuning LLMs
  • aligning LLMs
  • Reinforcement Learning
  • Reinforcement Fine-Tuning (RFT)
  • dataset management
  • model management
  • experiment management
  • multi-node orchestration
  • checkpointing
  • failure recovery
  • cost-efficient scaling
  • SFT
  • PEFT
  • LoRA
  • adapters
  • distillation
  • policy optimization
  • reward modeling
  • versioning
  • lineage
  • evaluation
  • reproducible fine-tuning

Nice to have

  • Golang
  • Python
  • PyTorch
  • vLLM
  • Performance optimizations on GPU systems
  • inference frameworks

What the JD emphasized

  • 8+ years of industry experience leading and driving impactful projects in the AI Space
  • Hands-on experience training, fine-tuning, and aligning LLMs using Reinforcement Learning and Reinforcement Fine-Tuning (RFT) techniques.

Other signals

  • building a comprehensive managed platform for the entire application development lifecycle
  • leveraging Machine Learning models, including Large Language Models (LLMs)
  • Manage fine-tuning systems for large foundation models
  • Implement and maintain end-to-end training pipelines for Large Language Models
  • Dataset, model, and experiment management