Senior Software Engineer, ML Infrastructure

Decagon Decagon · Vertical AI · San Francisco, CA · Engineering

Senior ML Infrastructure Engineer to own platforms for Decagon's model training and inference, including distributed training systems, inference architecture across multiple providers, and frameworks for research and product teams. Focus on technical depth, leading initiatives, and shaping ML stack architecture.

What you'd actually do

  1. Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
  2. Integrate state-of-the-art training algorithms into production pipelines
  3. Own inference architecture and multi-provider routing, including failover and optimization
  4. Lead initiatives to improve latency and cost efficiency across the training and serving stack
  5. Build evaluation and experimentation infrastructure that enables rapid, reliable iteration
  6. Drive technical direction, mentor engineers, and establish best practices for ML infrastructure

Skills

Required

  • ML infrastructure
  • production systems
  • distributed training
  • multi-node GPU clusters
  • fault tolerance
  • optimization
  • LLM inference
  • latency optimization
  • provider tradeoffs
  • serving architecture
  • technical leadership
  • multi-quarter technical projects

Nice to have

  • multimodal fine-tuning
  • post-training
  • state-of-the-art training algorithms
  • multi-provider routing
  • failover
  • cost efficiency
  • evaluation and experimentation infrastructure
  • mentoring engineers

What the JD emphasized

  • 6+ years building ML infrastructure or production systems at scale
  • Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
  • Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
  • Proven track record leading complex, multi-quarter technical projects

Other signals

  • building distributed training systems
  • owning inference architecture
  • creating frameworks for research and product teams