Software Engineer - Voice AI (inference Runtime)

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer focused on building and optimizing the inference runtime for Voice AI models, including state-of-the-art open-source models. The role involves developing large-scale, real-time infrastructure for multi-model voice agents, reducing latency, increasing throughput, and improving GPU efficiency. It also includes designing iteration loops for voice model customization and customization.

What you'd actually do

  1. Own and lead Voice AI product areas end-to-end - from architecture and system design through implementation, rollout, and long-term production operations.
  2. Design, build, and operate real-time, large-scale, high-performance model serving systems for STT, TTS, and voice agent workloads for mission-critical customer deployments
  3. Drive cross-team collaboration with sister engineering teams to solve full-stack technical problems, align on priorities, and coordinate end-to-end delivery across the product surface area
  4. Mentor teammates through code reviews, design docs, and technical leadership.

Skills

Required

  • Bachelor's degree or higher in Computer Science or related field
  • Proven track record owning production-grade real-time, large-scale systems where tail latency (p99) matters.
  • Proficient coding abilities in one or more popular programming or scripting languages; Python proficiency is a plus.
  • Good taste in product, particularly developer-oriented tools
  • Interest in ML/AI infrastructure and willingness to learn
  • Strong collaboration and communication skills
  • Comfortable using AI coding assistants (e.g., Claude Code, Codex, Cursor) as a daily productivity multiplier — as an AI-native company, we see this as a must-have skill.

Nice to have

  • Experience implementing pipeline-level model runtime optimizations such as dynamic batching, async scheduling, or decode-side throughput improvements.
  • Experience building developer platforms: SDKs, CLIs, APIs, and self-serve workflows for ML or infrastructure products.
  • Experience with containerization and orchestration technologies (Docker, Kubernetes), service meshes, or distributed scheduling.
  • Familiarity with speech/audio ML models (STT, TTS, speech-to-speech)
  • Familiarity with model-serving runtimes (vLLM, TensorRT, ONNX).
  • Familiarity with systems-level performance profiling across host-device boundaries (e.g. PyTorch Profiler), diagnosing GPU utilization issues
  • Exposure to customer-facing engineering: pre-sales prototyping, technical discovery, or working directly with customers to ship solutions.

What the JD emphasized

  • hard to build
  • primary owner of Baseten Voice AI
  • reduce end-to-end and tail latency
  • multi-model voice agents
  • real-time, large-scale, high-performance model serving systems
  • tail latency (p99) matters

Other signals

  • inference runtime
  • model serving
  • voice AI
  • real-time infrastructure
  • multi-model voice agents