Senior Backend Engineer, Inference Platform

Together AI Together AI · Data AI · San Francisco, CA · Engineering

Senior Backend Engineer focused on building and optimizing the inference platform for advanced generative AI models, including LLMs and multimodal models, at scale. The role involves optimizing latency, throughput, and resource allocation across tens of thousands of GPUs, collaborating with researchers to productionize frontier models, and contributing to open-source inference projects.

What you'd actually do

  1. Build and optimize global and local request routing, ensuring low-latency load balancing across data centers and model engine pods.
  2. Develop auto-scaling systems to dynamically allocate resources and meet strict SLOs across dozens of data centers.
  3. Design systems for multi-tenant traffic shaping, tuning both resource allocation and request handling — including smart rate limiting and regulation — to ensure fairness and consistent experience across all users.
  4. Engineer trade-offs between latency and throughput to serve diverse workloads efficiently.
  5. Optimize prefix caching to reduce model compute and speed up responses.

Skills

Required

  • 5+ years of demonstrated experience building large-scale, fault-tolerant, distributed systems and API microservices.
  • Strong background in designing, analyzing, and improving efficiency, scalability, and stability of complex systems.
  • Expert-level programming in one or more of: Rust, Go, Python, or TypeScript.

Nice to have

  • Knowledge of modern LLMs and generative models and how they are served in production is a plus.
  • Experience working with the open source ecosystem around inference is highly valuable; familiarity with SGLang, vLLM, or NVIDIA Dynamo will be especially handy.
  • Experience with Kubernetes or container orchestration is a strong plus.
  • Familiarity with GPU software stacks (CUDA, Triton, NCCL) and HPC technologies (InfiniBand, NVLink, MPI) is a plus.
  • Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience.

What the JD emphasized

  • optimizing latency down to the last millisecond
  • tens of thousands of GPUs
  • fully utilize every FLOP and every gigabyte of memory
  • low-latency load balancing
  • strict SLOs
  • multi-tenant traffic shaping
  • low-level OS concepts: multi-threading, memory management, networking, and storage performance

Other signals

  • inference platform
  • LLMs
  • multimodal models
  • serving at scale
  • GPU optimization