Software Engineer II - Machine Learning

Uber Uber · Consumer · Sunnyvale, CA · Engineering

Uber is seeking a Senior ML Engineer to build and scale an autonomous support agent for customer service. The role involves LLM orchestration, evaluation, safety guardrails, and ensuring reliability and cost efficiency in production systems handling millions of conversations. The engineer will also advance retrieval and reasoning pipelines and establish evaluation frameworks.

What you'd actually do

  1. Work on agent architecture: agentic planning and execution loops, long-term memory, persona/voice, knowledge routing, and policy enforcement for compliant, on‑brand conversations.
  2. Ship production systems that handle millions of conversations with rigorous SLOs, fallbacks, and canaries; design graceful degradation (e.g., human handoff) and safety guardrails (prompt‑injection, jailbreak, PII redaction).
  3. Advance retrieval & reasoning: Build next-generation retrieval and reasoning pipelines, where the agent can search across different knowledge sources, apply policy-driven tools, and call structured workflows and ensure that responses are consistently grounded.
  4. Establish evals that matter: offline rubrics, simulated scenarios, safety tests, cost/latency tradeoff suites, and LLM‑as‑judge (with calibrated human review) wired into CI/CD and experiment platforms.
  5. Drive automation at scale: partner with Product/Design/Operations on coverage, policy alignment, localization, and rollout strategy to better customer experience and reduce cost per contact.

Skills

Required

  • Background in LLM-driven systems (inference optimization, prompt/program design, fine-tuning, distillation/LoRA, safety/guardrails, evals)
  • Strong software engineering in Python
  • Bachelor's degree (or above) in Computer Science or related field

Nice to have

  • Agentic architectures in production (planner/executor, memory, multi-step reasoning) and RAG over complex, policy-heavy knowledge bases.
  • Experience building support automation for large consumer platforms (routing, policy codification, internal tooling, co-pilot/auto-resolve).
  • Multilingual NLU/NLG (code-switching, low-resource languages), hallucination mitigation, safety red-teaming, and privacy-by-design.
  • Practical expertise balancing speed and reliability at scale: experiment frameworks, feature flags, canary/guarded rollouts, and clear kill-switches.

What the JD emphasized

  • autonomous support agent
  • LLM orchestration
  • evaluation
  • safety guardrails
  • reliability
  • cost efficiency
  • agentic planning and execution loops
  • long-term memory
  • policy enforcement
  • rigorous SLOs
  • graceful degradation
  • safety guardrails
  • retrieval and reasoning pipelines
  • policy-driven tools
  • structured workflows
  • offline rubrics
  • simulated scenarios
  • safety tests
  • cost/latency tradeoff suites
  • LLM‑as‑judge
  • coverage
  • policy alignment
  • localization
  • rollout strategy
  • customer experience
  • cost per contact
  • LLM‑driven systems
  • inference optimization
  • prompt/program design
  • fine‑tuning
  • distillation/LoRA
  • safety/guardrails
  • evals
  • Python
  • Agentic architectures in production
  • planner/executor
  • memory
  • multi‑step reasoning
  • RAG
  • policy‑heavy knowledge bases
  • support automation
  • large consumer platforms
  • routing
  • policy codification
  • internal tooling
  • co‑pilot/auto‑resolve
  • Multilingual NLU/NLG
  • code‑switching
  • low‑resource languages
  • hallucination mitigation
  • safety red‑teaming
  • privacy‑by‑design
  • balancing speed and reliability at scale
  • experiment frameworks
  • feature flags
  • canary/guarded rollouts
  • clear kill‑switches

Other signals

  • autonomous support agent
  • LLM orchestration
  • evaluation
  • safety guardrails
  • reliability
  • cost efficiency