Senior Software Engineer, AI Model Lifecycle

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role focuses on building and managing a platform for the AI model lifecycle, specifically for Large Language Models (LLMs). Responsibilities include managing fine-tuning systems (SFT, PEFT, LoRA, adapters), implementing training pipelines, handling distillation and reinforcement learning (RLHF, RLAIF), and managing datasets, models, and experiments. The role requires experience in Generative AI, training/fine-tuning/aligning LLMs, and ideally experience with performance optimizations on GPU systems and inference frameworks.

What you'd actually do

  1. Manage fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling.
  2. Implement and maintain end-to-end training pipelines for Large Language Models.
  3. RFT and Reinforcement learning to the fine tuning and training sections
  4. Distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling).
  5. Dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale.

Skills

Required

  • Generative AI (Large Language Models, Multimodal)
  • training LLMs
  • fine-tuning LLMs
  • aligning LLMs
  • Reinforcement Learning
  • Reinforcement Fine-Tuning (RFT) techniques
  • dataset management
  • model management
  • experiment management

Nice to have

  • Golang
  • Python
  • PyTorch
  • vLLM
  • performance optimizations on GPU systems
  • inference frameworks

What the JD emphasized

  • 4-5+ years of industry experience leading and driving impactful projects in the AI Space
  • Hands-on experience training, fine-tuning, and aligning LLMs using Reinforcement Learning and Reinforcement Fine-Tuning (RFT) techniques.

Other signals

  • building a comprehensive managed platform for the entire application development lifecycle
  • leveraging Machine Learning models, including Large Language Models (LLMs)
  • fine-tuning systems for large foundation models
  • implement and maintain end-to-end training pipelines for Large Language Models
  • dataset, model, and experiment management