Sr. Software Development Engineer, Mlops

Amazon Amazon · Big Tech · Bellevue, WA · Software Development

Senior Software Development Engineer focused on building and operating ML training infrastructure for robot learning at scale. This role involves designing and implementing distributed GPU training pipelines, CI/CD for ML models, experiment tracking, data pipelines for robotics datasets, and operationalizing novel ML models, with a focus on Kubernetes and large-scale distributed systems.

What you'd actually do

  1. Design and implement scalable ML training infrastructure on Kubernetes (EKS) with GPU scheduling and fault-tolerant distributed training
  2. Build and maintain CI/CD pipelines for ML models — from data ingestion through training, evaluation, and deployment
  3. Develop tooling for experiment tracking, hyperparameter optimization, and reproducibility
  4. Architect data pipelines that handle large-scale robotics datasets (telemetry, sensor recordings, demonstrations)
  5. Collaborate with research scientists to operationalize novel ML models into production

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team

Nice to have

  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques
  • Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT

What the JD emphasized

  • ML training infrastructure
  • robot learning at scale
  • distributed GPU training
  • operationalize novel ML models
  • large-scale robotics datasets

Other signals

  • ML training infrastructure
  • robot learning at scale
  • distributed GPU training
  • MLOps
  • operationalize novel ML models