Software Engineer, Machine Learning Infrastructure - Gen AI

DoorDash DoorDash · Consumer · San Francisco, CA · 313 Infrastructure Engineering

Software Engineer, Machine Learning Infrastructure - Gen AI role focused on building and scaling the production infrastructure for Generative AI at DoorDash. This includes ownership of core platform surfaces like LLM Gateway, Agent Gateway, evals infrastructure, model serving, batch inference, guardrails, and cost attribution. The role involves designing scalable systems for AI agents, tool orchestration, retrieval, and evaluation workflows, partnering with various teams to enable GenAI-powered products and automation.

What you'd actually do

  1. Build the infrastructure that helps DoorDash teams move GenAI ideas from prototype to production, increasing the velocity of business impact from AI across the company.
  2. Work on production GenAI platform surfaces including the LLM Gateway, Agent Gateway, evals infrastructure, open-weights model serving, batch inference, fine-tuning, guardrails, and cost attribution.
  3. Design scalable systems for AI agents, MCP/tool orchestration, retrieval, batch inference, model serving, and evaluation workflows that power real customer and internal automation use cases
  4. Help product teams choose the right model and vendor strategy across closed-source and open-weight models, with reliability, fallback, observability, and cost controls built in.
  5. Build platforms that support rapid experimentation while meeting production standards for latency, scale, monitoring, SLOs, playbooks, and operational excellence.

Skills

Required

  • Python
  • distributed systems
  • building production services
  • APIs
  • data pipelines
  • ML infrastructure at scale
  • operating systems in production
  • observability
  • debugging
  • reliability
  • incident response
  • performance/cost optimization
  • machine learning workflows
  • inference
  • evaluation
  • feature/data pipelines
  • model serving
  • experimentation
  • ambiguous, fast-moving technical areas
  • customer use cases into reusable platform capabilities

Nice to have

  • fine-tuning open-weights LLMs in production
  • serving open-weights LLMs in production
  • building and deploying AI agents in production
  • building and deploying MCP servers in production
  • LLM gateways
  • model routing
  • vendor abstraction
  • cost attribution
  • eval systems
  • LLM observability
  • tracing
  • LLM-as-judge workflows
  • RAG
  • search
  • vector databases
  • retrieval pipelines
  • Kubernetes
  • cloud infrastructure (AWS/GCP)
  • GPUs
  • high-throughput batch systems
  • developer platforms
  • internal platforms
  • self-serve infrastructure

What the JD emphasized

  • production infrastructure
  • GenAI
  • LLM Gateway
  • Agent Gateway
  • evals infrastructure
  • model serving
  • batch inference
  • guardrails
  • cost attribution
  • AI agents
  • tool orchestration
  • evaluation workflows
  • reliability
  • fallback
  • observability
  • cost controls
  • latency
  • scale
  • monitoring
  • SLOs
  • operational excellence

Other signals

  • building production infrastructure for Generative AI
  • increase the velocity of business impact from GenAI
  • production GenAI platform surfaces including the LLM Gateway, Agent Gateway, evals infrastructure, open-weights model serving, batch inference, fine-tuning, guardrails, and cost attribution
  • Design scalable systems for AI agents, MCP/tool orchestration, retrieval, batch inference, model serving, and evaluation workflows