Principal Applied Scientist, ML Codesign

Amazon Amazon · Big Tech · Sunnyvale, CA · Applied Science

This role is for a Principal Applied Scientist focused on the joint optimization of model compression and silicon architecture for AI inference accelerators. The scientist will define the hardware-aware compression roadmap, own the optimization of compression algorithms with hardware, and influence silicon architecture decisions. The goal is to ship advanced compression techniques and large models on next-generation accelerators, bridging the gap between model accuracy and hardware efficiency.

What you'd actually do

  1. Define the hardware-aware compression roadmap for next-generation accelerators, working backward from accuracy targets on standard language and reasoning benchmarks including Massive Multitask Language Understanding (MMLU), GSM8K, HumanEval, and Instruction Following Evaluation (IFEval).
  2. Own the joint optimization of compression algorithms (post-training quantization, quantization-aware training, knowledge distillation, structured pruning) with the underlying hardware.
  3. Represent applied science in silicon architecture reviews and influence decisions across the memory and compute subsystems of the accelerator.
  4. Set the science roadmap for the compression techniques the next architecture must support; validate that compression algorithms achieve target accuracy on the benchmarks our products are evaluated against.
  5. Mentor a team of senior and mid-level applied scientists working on compression and hardware-aware training.

Skills

Required

  • Master's or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent industry experience
  • Eight or more years of industry experience with a track record of first-author or senior-author publications at top-tier venues in machine learning systems, computer architecture, or efficient machine learning
  • Demonstrated experience defining or co-defining a hardware architecture that shipped, including silicon, Field Programmable Gate Array (FPGA), or large-scale software accelerator
  • Deep expertise in at least two of the following: low-bit quantization, structured and unstructured pruning, knowledge distillation, sparse computation, hardware-aware neural architecture search
  • Working knowledge of computer architecture fundamentals: memory hierarchy, dataflow architectures, on-chip interconnect

Nice to have

  • Direct experience contributing to silicon architecture for machine learning inference
  • Published work demonstrating hardware-software codesign, where the compression algorithm and the hardware were optimized jointly rather than sequentially
  • Experience applying compression techniques at large-model scale (tens of billions of parameters)
  • Familiarity with Application-Specific Integrated Circuit (ASIC) development flow, Register Transfer Level (RTL) review, or compiler intermediate representations including Multi-Level Intermediate Representation (MLIR) and OpenXLA
  • Experience with Mixture-of-Experts (MoE) inference architectures
  • Track record of mentoring senior applied scientists and shaping a multi-year research agenda
  • Prior experience with vertically integrated stacks where the same team owns model, compiler, runtime, and silicon

What the JD emphasized

  • published at MLSys, ISCA, MICRO, NeurIPS, or ICML on quantization, pruning, or hardware-aware training
  • shipping chips
  • vertical stack—model, compression, compiler, runtime, operating system, silicon—where the same engineering organization owns every layer
  • silicon architecture
  • compression algorithms

Other signals

  • influences model, compiler, runtime, and silicon
  • joint optimization of model compression and silicon architecture
  • shipping chips