Applied Scientist, Edge AI and Science

Amazon Amazon · Big Tech · Cambridge, MA, United Kingdom · Machine Learning Science

Applied Scientist role focused on compressing generative AI models (LLMs, VLMs, speech, audio, omni) for edge and cloud deployment. The role involves applying and extending state-of-the-art compression techniques (knowledge distillation, pruning, quantization), designing healing recipes (fine-tuning) to recover accuracy, building reference implementations for partner teams, and defining benchmarks for evaluating trade-offs (accuracy, latency, memory, cost). The goal is to make training-to-deployment seamless.

What you'd actually do

  1. Apply and extend compression recipes (knowledge distillation, structured pruning, and post-training and quantization-aware quantization including low-bit and mixed-precision) to assigned models, achieving 20x to 100x compression while preserving model quality.
  2. Design and run healing recipes (fine-tuning and distillation that recover accuracy lost to compression), iterating on data mixes, objectives, and training settings until the compressed model meets its quality bar.
  3. Track emerging model architectures and dissect how they work internally, so you can choose where to compress, anticipate where accuracy will break, and design recovery strategies grounded in the model's actual structure.
  4. Build a library of compression-ready model entries: reference implementations, compression recipes, model cards, and benchmark results that partner teams can run self-service to produce deployment-ready artifacts for edge and cloud targets.
  5. Define the datasets, benchmarks, and KPIs that matter for your models, and build evaluation methodology that makes accuracy, latency, memory, and cost trade-offs explicit.

Skills

Required

  • knowledge distillation
  • structured pruning
  • post-training quantization
  • quantization-aware quantization
  • low-bit precision
  • mixed-precision
  • fine-tuning
  • model architectures
  • model quality
  • benchmarks
  • evaluation methodology
  • latency
  • memory
  • cost
  • reproducible code
  • testable code
  • well-documented code

Nice to have

  • LLMs
  • vision-language models
  • speech models
  • audio models
  • omni models
  • edge deployment
  • cloud deployment
  • MLOps
  • SDE I bar

What the JD emphasized

  • compression science
  • compression techniques
  • compression recipes
  • compression-ready
  • compression
  • compressing models
  • compressing a model
  • compress
  • compression-ready model entries
  • compression-science partners

Other signals

  • compressing models
  • deployment workflows
  • model compression
  • quantization
  • fine-tuning
  • inference