Member of Technical Staff - ML Engineer, Frontier AI Robotics

Amazon Amazon · Big Tech · San Francisco, CA · Software Development

ML Engineer role focused on building and optimizing distributed training infrastructure for large-scale deep learning and transformer-based models, specifically for frontier AI robotics applications. The role involves working with scientists and engineers to deliver scalable, high-performance systems, leveraging PyTorch, Python, and C++, and optimizing GPU performance for training.

What you'd actually do

  1. Design, build, and optimize machine learning infrastructure for large-scale training and inference.
  2. Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems.
  3. Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism.
  4. Monitor and optimize GPU memory and throughput for training large models efficiently.
  5. Collaborate cross-functionally with research, data infra teams to integrate new models and features.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Design, build, and optimize machine learning infrastructure for large-scale training and inference.
  • Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems.
  • Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism.
  • Monitor and optimize GPU memory and throughput for training large models efficiently.
  • Collaborate cross-functionally with research, data infra teams to integrate new models and features.
  • Deep understanding of LLM algorithm and deep learning framework like PyTorch.
  • Mathematics and Statistics: Strong understanding of linear algebra, calculus, probability, and statistics.

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • large-scale machine learning models
  • distributed training infrastructure
  • deep learning
  • transformer-based architectures
  • state-of-the-art AI research
  • foundation models
  • end-to-end learned systems
  • multimodal perception
  • sophisticated manipulation strategies
  • parallelism techniques
  • GPU memory and throughput optimization

Other signals

  • distributed training infrastructure
  • large-scale machine learning models
  • deep learning
  • transformer-based architectures
  • state-of-the-art AI research
  • foundation models
  • end-to-end learned systems
  • multimodal perception
  • sophisticated manipulation strategies
  • parallelism techniques
  • GPU memory and throughput optimization