Member of Technical Staff - Reinforcement Learning (infrastructure), Agi Autonomy

Amazon Amazon · Big Tech · San Francisco, CA · Applied Science

Develop training infrastructure for large-scale reinforcement learning on LLMs, working across the technology stack including ML systems, orchestration, and data management. Analyze, troubleshoot, and profile ML systems, and conduct MLSys research for new techniques and tooling.

What you'd actually do

  1. Develop training infrastructure to ensure large-scale reinforcement learning on LLMs runs highly efficient and robust.
  2. Work across the entire technology stack, including low level ML system, job orchestration and data management.
  3. Analyze, troubleshoot and profiling complex ML systems, identify and address performance bottlenecks.
  4. Work closely with researchers, conduct MLSys research to create new techniques, infrastructure, and tooling around emerging research capabilities.

Skills

Required

  • PhD, or Master's degree and 3+ years of applied research experience
  • Python
  • Java
  • C++
  • neural deep learning methods
  • machine learning
  • training and deploying machine learning systems
  • troubleshooting and debugging technical systems

Nice to have

  • various machine learning techniques and parameters that affect their performance
  • large scale machine learning systems
  • profiling and debugging
  • system performance and scalability
  • distributed system
  • Megatron
  • vLLM
  • Ray
  • working with GPUs
  • patents or publications at top-tier peer-reviewed conferences or journals

What the JD emphasized

  • large-scale reinforcement learning
  • LLMs
  • MLSys research

Other signals

  • reinforcement learning
  • large-scale training infrastructure
  • ML systems