Senior Software Development Engineer , Stores Foundational AI - Rufus

Amazon Amazon · Big Tech · Palo Alto, CA · Software Development

Senior Software Development Engineer focused on building and scaling foundational LLMs for Amazon Stores. The role involves architecting and building ML infrastructure for LLM training and post-training workflows (fine-tuning, RL, continuous learning), transforming customer interactions into training signals, optimizing RL systems, and partnering with scientists to productionize frontier techniques like RLHF and agentic workflows. Emphasis on end-to-end system ownership, including design, implementation, deployment, and observability, with a focus on low-level optimization like CUDA kernels and ML platforms.

What you'd actually do

  1. Architect and build scalable ML infrastructure powering LLM training and post-training workflows, including supervised fine-tuning, reinforcement learning, and continuous learning from live traffic
  2. Transform real-world customer interactions into high-quality training signals, enabling continuous model improvement and better customer experiences
  3. Build and optimize post-training and RL systems, including reward modeling, policy optimization, data collection loops.
  4. Drive experimentation and iteration velocity by building tooling and frameworks that enable rapid hypothesis testing, signal validation, and model quality improvements
  5. Partner closely with applied scientists to translate frontier techniques (e.g., RLHF, agentic workflows, multi-turn optimization) into reliable, production-grade systems

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Experience with vLLM, SGLang, TensorRT or similar platforms in production environments
  • Experience with CUDA kernels or ML/low-level kernels

Nice to have

  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution

What the JD emphasized

  • post-training
  • reinforcement learning
  • continuous learning
  • customer interactions
  • frontier research
  • RLHF
  • agentic workflows
  • multi-turn optimization
  • production-grade systems
  • vLLM
  • SGLang
  • TensorRT
  • CUDA kernels
  • ML/low-level kernels

Other signals

  • building foundational LLMs
  • continuous learning from real-world customer interactions
  • large-scale systems
  • ML infrastructure
  • frontier research
  • post-training
  • reinforcement learning
  • production at Amazon scale