Sr Applied Scientist, ML Codesign, Edge AI Platform

Amazon Amazon · Big Tech · Sunnyvale, CA · Applied Science

This role focuses on the joint optimization of model compression and silicon architecture for Amazon's edge and cloud inference accelerators. The scientist will define the hardware-aware compression roadmap, own the optimization of compression algorithms with hardware, and represent applied science in silicon architecture reviews. The goal is to ship advanced quantization and distillation techniques in production for large language models.

What you'd actually do

  1. Define the hardware-aware compression roadmap for next-generation accelerators, working backward from accuracy targets on standard language and reasoning benchmarks including Massive Multitask Language Understanding (MMLU), GSM8K, HumanEval, and Instruction Following Evaluation (IFEval).
  2. Own the joint optimization of compression algorithms (post-training quantization, quantization-aware training, knowledge distillation, structured pruning) with the underlying hardware.
  3. Represent applied science in silicon architecture reviews and influence decisions across the memory and compute subsystems of the accelerator.
  4. Set the science roadmap for the compression techniques the next architecture must support; validate that compression algorithms achieve target accuracy on the benchmarks our products are evaluated against.
  5. Mentor a team of senior and mid-level applied scientists working on compression and hardware-aware training.

Skills

Required

  • 3+ years of building machine learning models for business application experience
  • PhD, or Master's degree and 6+ years of applied research experience
  • Experience programming in Java, C++, Python or related language
  • Experience with neural deep learning methods and machine learning

Nice to have

  • Experience with modeling tools such as R, scikit-learn, Spark MLLib, MxNet, Tensorflow, numpy, scipy etc.
  • Experience with large scale distributed systems such as Hadoop, Spark etc.

What the JD emphasized

  • hardware-aware compression
  • accuracy targets
  • quantization
  • knowledge distillation
  • structured pruning
  • silicon architecture
  • compression algorithms
  • quantization-aware training

Other signals

  • joint optimization of model compression and silicon architecture
  • next generation of edge and cloud inference accelerators
  • advanced quantization and large-model distillation in production
  • multi-billion parameter language models at inference economics
  • senior architect of the next-generation accelerator and of the compression algorithms it executes natively