Software Dev Engineer Ii, Stores Foundational AI -sfai

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer II role focused on building and improving generative AI for shopping using LLMs. Responsibilities include designing and implementing stable and efficient training systems for model training and reinforcement learning, developing scalable data infrastructure, and optimizing RL post-training pipelines. The role involves collaborating with scientists and engineers to accelerate innovation and translate research into production-ready systems.

What you'd actually do

  1. Design and implementation of a stable and efficient training system for model training and reinforcement learning that scale to various of model sizes and architecture.
  2. Collaborate with other talented applied scientists and engineers to improve training efficiency and reliability that accelerates innovation.
  3. Design and implement scalable data infrastructure: that handle Amazon-scale data ingestion, processing, and delivery across different training and evaluation stages;
  4. Quickly learn and adopt state-of-the-art technologies and algorithms in the field of Generative AI.

Skills

Required

  • Machine Learning and LLM fundamentals
  • transformer architecture
  • training/inference lifecycles
  • optimization techniques
  • software development experience
  • design or architecture of new and existing systems
  • programming with at least one software programming language

Nice to have

  • JAX
  • PyTorch
  • vLLM
  • SGLang
  • Dynamo
  • TorchXLA
  • TensorRT
  • system performance
  • memory management
  • parallel computing principles
  • CUDA/C++/Kernel development

What the JD emphasized

  • training system
  • reinforcement learning
  • scalable data infrastructure
  • RL post-training pipelines
  • RL training stability
  • RL post-training efficiency
  • production-ready systems
  • observability systems for training dynamics

Other signals

  • Generative AI for shopping
  • LLM training system
  • RL post-training pipelines
  • scalable data infrastructure