Sde- ML Engineer, Frontier AI Robotics

Amazon Amazon · Big Tech · San Francisco, CA · Software Development

Machine Learning Systems Engineer for Frontier AI Robotics team, focusing on building and optimizing distributed training infrastructure for large-scale deep learning and transformer models. Role involves engineering scalable, high-performance systems for AI research and applications, with a focus on robotics, multimodal perception, and manipulation strategies. Requires strong software development, ML infrastructure, and deep learning framework expertise.

What you'd actually do

  1. Design, build, and optimize machine learning infrastructure for large-scale training and inference.
  2. Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems.
  3. Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism.
  4. Monitor and optimize GPU memory and throughput for training large models efficiently.
  5. Collaborate cross-functionally with research, data infra teams to integrate new models and features.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • PyTorch
  • Python
  • C++
  • Deep understanding of LLM algorithm and deep learning framework like PyTorch
  • Mathematics and Statistics: Strong understanding of linear algebra, calculus, probability, and statistics

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • large-scale training
  • large models
  • large-scale machine learning models
  • large-scale training and inference

Other signals

  • distributed training infrastructure
  • large-scale machine learning models
  • deep learning
  • transformer-based architectures
  • scalable, high-performance systems
  • frontier foundation models
  • end-to-end learned systems
  • multimodal perception
  • sophisticated manipulation strategies
  • parallelism techniques