Software Engineer - Model Apis

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer role focused on optimizing and operating Model APIs for AI inference, involving distributed systems, model serving, and developer experience. The role emphasizes performance improvements, structured outputs, tool/function calling, and multi-modal serving.

What you'd actually do

  1. Design, build, and operate the Model APIs surface with focus on advanced inference capabilities: structured outputs (JSON mode, grammar-constrained generation), tool/function calling and multi-modal serving
  2. Profile and optimize TensorRT-LLM kernels, analyze CUDA kernel performance, implement custom CUDA operators, tune memory allocation patterns for maximum throughput and optimize communication patterns across multi-GPU setups
  3. Productionize performance improvements across runtimes with deep understanding of their internals: speculative decoding implementations, guided generation for structured outputs, custom scheduling and routing algorithms for high-performance serving
  4. Build comprehensive benchmarking frameworks that measure real-world performance across different model architectures, batch sizes, sequence lengths, and hardware configurations
  5. Productionize performance improvements across runtimes (e.g.TensorRT, TensorRT‑LLM): speculative decoding, quantization, batching, and KV‑cache reuse.

Skills

Required

  • 3+ years experience building and operating distributed systems or large‑scale APIs
  • Proven track record of owning low‑latency, reliable backend services (rate‑limiting, auth, quotas, metering, migrations)
  • Infra instincts with performance sensibilities: profiling, tracing, capacity planning, and SLO management
  • Comfortable debugging complex systems, from runtime internals to GPU execution traces
  • Strong written communication

Nice to have

  • Experience with LLM runtimes (vLLM, SGLang, TensorRT-LLM) or contributions to open-source inference engines (vLLM, TensorRT-LLM, SGLang, TGI)
  • Knowledge of Kubernetes, service meshes, API gateways, or distributed scheduling
  • Background in developer‑facing infrastructure or open‑source APIs
  • ML experience

What the JD emphasized

  • low-latency
  • reliable backend services
  • performance

Other signals

  • Model APIs
  • inference capabilities
  • performance optimization
  • distributed systems
  • developer experience