Sr. Software Engineer, Infrastructure, Mlops, Autonomy

Rivian Rivian · Auto · Belgrade, Serbia · Autonomous Driving

Sr. Software Engineer, Infrastructure, MLOps, Autonomy role at Rivian focused on building and optimizing ML infrastructure for autonomous driving initiatives. Responsibilities include leading ML platform engineering, optimizing training performance, setting up scalable data pipelines, owning ML lifecycle CI/CD, and managing GPU/cloud costs. Requires expertise in Kubernetes, AWS, Python, Go/Java, Terraform, and monitoring tools.

What you'd actually do

  1. Lead ML Platform Engineering: Build, test, and release mission-critical infrastructure specifically designed for large-scale ML workloads (training, evaluation, and simulation) on AWS and on-prem.
  2. Optimize Training Performance: Partner with the Perception ML team to improve training throughput, GPU utilization, and model iteration speed.
  3. Scalable Data Pipelines: Setup fault-tolerant, multi-region environments for massive-scale data preparation and ingestion required for autonomous driving models.
  4. Own ML Lifecycle CI/CD: Design and maintain specialized pipelines for model versioning, automated evaluation, and seamless deployment to edge or cloud environments.
  5. GPU & Cloud Cost Management: Drive aggressive cost optimization strategies in AWS, focusing on high-cost resources like P-family instances, Spot instances, and large-scale S3 storage.

Skills

Required

  • Software Engineering
  • DevOps
  • MLOps
  • Kubernetes (EKS)
  • AWS (S3, RDS, Secrets Manager, CloudWatch)
  • Python
  • Go or Java
  • GitLab
  • Terraform
  • Datadog, Prometheus, or Weights & Biases
  • GitOps
  • ML-specific tools (e.g., ArgoCD, Kubeflow, or Metaflow)
  • microservice architectures

Nice to have

  • CUDA
  • NCCL
  • PyTorch
  • TensorFlow
  • Linux internals
  • high-speed networking
  • distributed computing
  • AWS Solutions Architect or Machine Learning Specialty certification

What the JD emphasized

  • mission-critical infrastructure
  • large-scale ML workloads
  • training
  • evaluation
  • simulation
  • Perception ML team
  • training throughput
  • GPU utilization
  • model iteration speed
  • Scalable Data Pipelines
  • massive-scale data preparation
  • autonomous driving models
  • ML Lifecycle CI/CD
  • automated evaluation
  • deployment to edge or cloud environments
  • GPU & Cloud Cost Management
  • aggressive cost optimization strategies
  • high-cost resources
  • P-family instances
  • Spot instances
  • large-scale S3 storage
  • Cross-Functional Collaboration
  • ML Research
  • deep learning architectures
  • 4+ Yrs. experience in a Software Engineering, DevOps, or MLOps role
  • 4+ Yrs. experience managing production-grade distributed systems
  • high-throughput data
  • compute-heavy workloads
  • Expertise in ML Infrastructure
  • Deep knowledge of Kubernetes (EKS)
  • scheduling GPU workloads
  • managing node groups
  • large-scale EFS/FSx for Lustre storage
  • Cloud Proficiency
  • Advanced skills in the AWS stack
  • high-performance computing (HPC) patterns
  • Automation & Coding
  • Hands-on proficiency in Python
  • Go or Java
  • GitLab for orchestration
  • Infrastructure as Code
  • Expert-level mastery of Terraform
  • reproducible ML environments
  • Monitoring & Observability
  • monitoring high-performance clusters
  • MLOps Patterns
  • GitOps
  • ML-specific tools
  • microservice architectures
  • Knowledge of CUDA
  • NCCL
  • deep learning frameworks (PyTorch/TensorFlow)
  • Linux internals
  • high-speed networking
  • distributed computing
  • AWS Solutions Architect or Machine Learning Specialty certification

Other signals

  • MLOps
  • ML Infrastructure
  • Training
  • Evaluation
  • CI/CD
  • GPU Optimization
  • Data Pipelines