Senior ML Engineer, Fauna

Amazon Amazon · Big Tech · NY +1 · Software Development

Senior ML Engineer to build and scale ML systems for intelligent robots, focusing on designing and maintaining infrastructure for training, evaluating, and deploying ML models. The role involves working at the intersection of ML and systems engineering to ensure robust, efficient, and scalable systems, with a focus on optimizing model inference for edge devices.

What you'd actually do

  1. Design and build scalable ML training infrastructure, including distributed training pipelines and GPU cluster management both in the cloud and on-prem
  2. Develop systems for experiment tracking, model versioning, and reproducibility
  3. Build deployment infrastructure for serving ML models on robotic hardware with strict latency requirements
  4. Optimize model inference for edge devices and embedded systems
  5. Collaborate with research teams to accelerate the path from experimentation to production

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Bachelor's degree or above in computer science, machine learning, engineering, or related fields, or Master's degree
  • Experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution, or experience in development in the last 3 years
  • Experience with machine learning (ML) tools and methods
  • Experience in Kubernetes, Docker or containers ecosystem, or experience that includes strong analytical skills, attention to detail, and effective communication abilities and experience with programming/scripting (Batch, VB, PowerShell, Java, C#, Chef, Perl, Ruby and/or PHP)

Nice to have

  • Experience building and operating a cloud-based architecture
  • Experience with robotics data (sensor streams, video, point clouds) and real-time inference systems
  • Familiarity with model optimization techniques (quantization, pruning, distillation)
  • Experience with reinforcement learning or simulation-based training pipelines

What the JD emphasized

  • strict latency requirements
  • edge devices
  • embedded systems
  • robotics data
  • real-time inference systems
  • reinforcement learning
  • simulation-based training pipelines

Other signals

  • build and scale ML systems for robots
  • design and maintain infrastructure for training, evaluating, and deploying ML models
  • intersection of ML and systems engineering
  • robust, efficient, and scalable ML training and deployment systems
  • optimize model inference for edge devices