Staff Software Engineer, ML Infrastructure

Decagon Decagon · Vertical AI · San Francisco, CA · Engineering

Staff Software Engineer, ML Infrastructure at Decagon, a conversational AI platform. The role focuses on building and owning platforms for model training (LLM and multimodal fine-tuning/post-training) and inference, including multi-provider routing and optimization. The team works at the intersection of research and production, translating ML models into scalable systems.

What you'd actually do

  1. Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
  2. Integrate state-of-the-art training algorithms into production pipelines
  3. Own inference architecture and multi-provider routing, including failover and optimization
  4. Lead initiatives to improve latency and cost efficiency across the training and serving stack
  5. Build evaluation and experimentation infrastructure that enables rapid, reliable iteration

Skills

Required

  • Distributed training systems
  • LLM inference architecture
  • Multi-provider routing
  • Latency and cost optimization
  • Evaluation and experimentation infrastructure
  • Technical leadership
  • Mentoring engineers
  • Best practices for ML infrastructure
  • 10+ years of experience

Nice to have

  • Fine-tuning
  • Post-training
  • Multimodal models
  • Fault tolerance
  • GPU clusters

What the JD emphasized

  • 10+ years building ML infrastructure or production systems at scale
  • Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
  • Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
  • Proven track record leading complex, multi-quarter technical projects

Other signals

  • ML Infrastructure
  • Distributed Training
  • LLM Inference
  • Model Lifecycle