Software Dev Engineer Ii, Stores Foundational AI -sfai

Amazon Amazon · Big Tech · Palo Alto, CA · Software Development

Software Development Engineer II at Amazon on the Stores Foundational AI team, focusing on building and optimizing large-scale LLM training infrastructure, including pretraining and RL post-training pipelines, data infrastructure, and observability systems for generative AI in shopping.

What you'd actually do

  1. Design and implementation of a stable and efficient training system for model training and reinforcement learning that scale to various of model sizes and architecture.
  2. Collaborate with other talented applied scientists and engineers to improve training efficiency and reliability that accelerates innovation.
  3. Design and implement scalable data infrastructure: that handle Amazon-scale data ingestion, processing, and delivery across different training and evaluation stages;
  4. Quickly learn and adopt state-of-the-art technologies and algorithms in the field of Generative AI.
  5. Design and build end-to-end RL post-training pipelines (rollout → reward → optimization) at cluster scale

Skills

Required

  • Software development experience
  • system design and architecture
  • programming with at least one software programming language
  • Machine Learning and LLM fundamentals
  • transformer architecture
  • training/inference lifecycles
  • optimization techniques

Nice to have

  • JAX
  • PyTorch
  • vLLM
  • SGLang
  • Dynamo
  • TorchXLA
  • TensorRT
  • system performance
  • memory management
  • parallel computing principles
  • CUDA/C++/Kernel development

What the JD emphasized

  • Amazon-scale data ingestion
  • RL post-training pipelines
  • RL training stability
  • RL post-training efficiency
  • production-ready systems
  • observability systems for training dynamics

Other signals

  • develop generative AI for shopping
  • design and implementation of a stable and efficient training system
  • design and implement scalable data infrastructure
  • build end-to-end RL post-training pipelines
  • improve RL training stability
  • optimize RL post-training efficiency
  • translate new RL algorithms into scalable, production-ready systems
  • build observability systems for training dynamics