Sr Software Dev Engineer, Stores Foundational AI -sfai

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Development Engineer focused on building and scaling ML infrastructure for foundational LLMs in Amazon Stores, specifically involving RL post-training pipelines, stability, efficiency, and translating research into production systems.

What you'd actually do

  1. Architect and build scalable ML infrastructure that powers the training and deployment of large language models—directly shaping the future of AI-driven shopping experiences for all Amazon customers
  2. Drive technical innovation by designing experimentation frameworks and tooling that accelerate breakthrough insights, enabling scientists and engineers to iterate faster and smarter
  3. Lead cross-functional initiatives partnering with applied scientists and engineering teams to translate frontier research into production systems that delight customers
  4. Mentor and elevate the team through technical leadership, code reviews, and architectural guidance—raising the bar for engineering excellence across the organization
  5. Own impactful projects end-to-end across diverse technologies—from distributed computing and ML operations to prompt engineering—while navigating ambiguity and making strategic trade-offs that balance innovation with delivery

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 4+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques
  • Demonstrated ability to drive technical direction and influence engineering decisions across teams

Nice to have

  • Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
  • Experience with distributed systems, big data technologies, and machine learning infrastructure

What the JD emphasized

  • architect and build scalable ML infrastructure
  • design experimentation frameworks and tooling
  • translate frontier research into production systems
  • end-to-end RL post-training pipelines
  • improve RL training stability
  • optimize RL post-training efficiency
  • translate new RL algorithms into scalable, production-ready systems
  • build observability systems for training dynamics

Other signals

  • building foundational LLM for Amazon Stores
  • develop generative AI for shopping
  • architect and build scalable ML infrastructure
  • design experimentation frameworks and tooling
  • translate frontier research into production systems
  • end-to-end RL post-training pipelines
  • improve RL training stability
  • optimize RL post-training efficiency
  • translate new RL algorithms into scalable, production-ready systems
  • build observability systems for training dynamics