Software Engineer - Training Product

Baseten Baseten · Data AI · San Francisco, CA · EPD

Software Engineer focused on building and shipping training products for AI companies, working across the full stack from API to infrastructure, including fine-tuning models and partnering with research engineers. The role involves developing features like multi-node training and serverless RL, with a focus on developer experience and reliability.

What you'd actually do

  1. Iterate like crazy
  2. Design ergonomic APIs and abstractions to model complex resources and lifecycles
  3. Work throughout the stack (API layer, backend and database implementation, infra layer; frontend is a plus) to implement features.
  4. Fine-tune and deploy models to develop intuition around training workflows.
  5. Partner closely with model developers and world-class research engineers to understand the requirements and pain points of post-training workflows.

Skills

Required

  • 5+ years experience building software applications
  • Deep knowledge of the web stack, databases, and distributed systems
  • Experience developing developer tooling or infrastructure products for external or internal users.
  • Good taste in product, particularly developer-oriented tools
  • Interest in ML/AI infrastructure and willingness to learn
  • Driven by high agency and ownership
  • Strong communication skills with the ability to bridge technical depth and business needs

Nice to have

  • Experience launching features and products through different release cycles (MVP, Beta, GA, etc.)
  • Experience with model development methods and paradigms, like Supervised Fine-Tuning, Reinforcement Learning, Synthetic Data Generation, LoRA, Full Finetunes, etc.
  • Familiarity or experience with the open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainer) and distributed training techniques (FSDP, DeepSpeed).
  • Experience developing AI products, tooling, or agents
  • Frontend fluency

What the JD emphasized

  • fine tune models yourself
  • post-training workflows

Other signals

  • Owns features like multi-node training and products like serverless reinforcement learning (RL) from conception to MVP
  • Work throughout the stack, architecting solutions from API and UI down to our infrastructure layer
  • Fine tune models yourself to develop an understanding of user workflows
  • Partner closely with research engineers leveraging state-of-the-art training techniques
  • Checkpointing pipeline
  • Multinode training
  • Training DX