Software Dev Engineer Ii, Stores Foundational AI -sfai

Amazon Amazon · Big Tech · Palo Alto, CA · Software Development

Software Development Engineer II role focused on building and scaling generative AI training infrastructure, specifically for LLMs. Responsibilities include designing and implementing stable and efficient training systems, scalable data infrastructure, and end-to-end RL post-training pipelines. The role involves collaborating with scientists and engineers to improve training efficiency, reliability, and optimize RL training stability and efficiency. It also includes building observability systems and contributing to system design and technical roadmaps for a unified LLM training platform.

What you'd actually do

  1. Design and implementation of a stable and efficient training system for model training and reinforcement learning that scale to various of model sizes and architecture.
  2. Collaborate with other talented applied scientists and engineers to improve training efficiency and reliability that accelerates innovation.
  3. Design and implement scalable data infrastructure: that handle Amazon-scale data ingestion, processing, and delivery across different training and evaluation stages;
  4. Quickly learn and adopt state-of-the-art technologies and algorithms in the field of Generative AI.
  5. Design and build end-to-end RL post-training pipelines (rollout → reward → optimization) at cluster scale

Skills

Required

  • Machine Learning and LLM fundamentals
  • transformer architecture
  • training/inference lifecycles
  • optimization techniques
  • software development experience
  • system design and architecture

Nice to have

  • JAX
  • PyTorch
  • vLLM
  • SGLang
  • Dynamo
  • TorchXLA
  • TensorRT
  • system performance
  • memory management
  • parallel computing principles
  • CUDA/C++/Kernel development

What the JD emphasized

  • training system
  • scalable data infrastructure
  • RL post-training pipelines
  • RL training stability
  • RL post-training efficiency
  • observability systems for training dynamics

Other signals

  • develop generative AI for shopping
  • design and implementation of a stable and efficient training system
  • scalable data infrastructure
  • end-to-end RL post-training pipelines
  • translate new RL algorithms into scalable, production-ready systems
  • unified platform for large-scale LLM training