Research Engineer, Prime Video - Content Understanding

Amazon Amazon · Big Tech · Seattle, WA · Systems, Quality, & Security Engineering

Research Engineer role focused on building experiment frameworks, infrastructure, and deployment packages for Generative AI models within Prime Video. Responsibilities include managing GPU clusters, containerizing models, and developing monitoring pipelines for production deployment. The role supports applied scientists and involves end-to-end ownership of model training and deployment.

What you'd actually do

  1. Architect and operate SLURM cluster infrastructure for distributed training and evaluation workloads
  2. Maintain healthy job status and manage queues from different users and teams
  3. Monitor feedback regarding cluster features and functions
  4. Identify and drive infrastructure optimization opportunities
  5. Develop self-service tooling, automation, and APIs
  6. Gather requirements from collaborators on research teams
  7. Containerize Machine Learning Models
  8. Shepherd models through deployment into beta environments, including debugging builds
  9. Develop data collection, aggregation and monitoring pipelines
  10. Build observability and monitoring systems or dashboards for ML infrastructure

Skills

Required

  • Experience in automating, deploying, and supporting large-scale infrastructure
  • Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust
  • Experience with Linux/Unix
  • Bachelor's degree, or 2+ years of working with Advanced Compute technologies including, but not limited to: Accelerated Compute, High Performance Compute, Visual/Spatial Compute, and/or IoT. experience
  • 3+ years of cloud computing technologies experience
  • Experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware, or experience leading the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

Nice to have

  • Experience with distributed systems at scale
  • Master's degree
  • Experience with large scale machine learning systems such as profiling and debugging and understanding of system performance and scalability
  • Publications or contributions to open-source HPC projects

What the JD emphasized

  • end-to-end ownership of model training and deployment
  • managing our GPU cluster
  • containerizing trained for deploying models into production

Other signals

  • build and deploy Generative AI models
  • end-to-end ownership of model training and deployment
  • managing our GPU cluster
  • containerizing trained for deploying models into production