Software Dev Engineer Ii, Stores Foundational AI -sfai

Amazon Amazon · Big Tech · Palo Alto, CA · Software Development

Software Development Engineer II focused on building and optimizing generative AI training systems, specifically for LLMs and RL post-training pipelines, at Amazon's Stores Foundational AI team. The role involves designing scalable data infrastructure, improving training efficiency and reliability, and translating research algorithms into production-ready systems.

What you'd actually do

  1. Design and implementation of a stable and efficient training system for model training and reinforcement learning that scale to various of model sizes and architecture.
  2. Collaborate with other talented applied scientists and engineers to improve training efficiency and reliability that accelerates innovation.
  3. Design and implement scalable data infrastructure: that handle Amazon-scale data ingestion, processing, and delivery across different training and evaluation stages;
  4. Quickly learn and adopt state-of-the-art technologies and algorithms in the field of Generative AI.
  5. Design and build end-to-end RL post-training pipelines (rollout → reward → optimization) at cluster scale

Skills

Required

  • Software development experience
  • System design and architecture
  • Programming with at least one software programming language
  • Machine Learning and LLM fundamentals
  • Transformer architecture
  • Training/inference lifecycles
  • Optimization techniques

Nice to have

  • JAX
  • PyTorch
  • vLLM
  • SGLang
  • Dynamo
  • TorchXLA
  • TensorRT
  • System performance
  • Memory management
  • Parallel computing principles
  • CUDA/C++/Kernel development

What the JD emphasized

  • training system
  • RL post-training
  • scalable data infrastructure
  • training efficiency
  • training stability
  • RL training stability
  • RL post-training efficiency
  • production-ready systems
  • observability systems

Other signals

  • develop generative AI for shopping
  • design and implementation of a stable and efficient training system
  • design and implement scalable data infrastructure
  • build end-to-end RL post-training pipelines
  • improve RL training stability
  • optimize RL post-training efficiency
  • translate new RL algorithms into scalable, production-ready systems
  • build observability systems for training dynamics