What you'd actually do

Lead ML Platform Engineering: Build, test, and release mission-critical infrastructure specifically designed for large-scale ML workloads (training, evaluation, and simulation) on AWS and on-prem.

Optimize Training Performance: Partner with the Perception ML team to improve training throughput, GPU utilization, and model iteration speed.

Scalable Data Pipelines: Setup fault-tolerant, multi-region environments for massive-scale data preparation and ingestion required for autonomous driving models.

Own ML Lifecycle CI/CD: Design and maintain specialized pipelines for model versioning, automated evaluation, and seamless deployment to edge or cloud environments.

GPU & Cloud Cost Management: Drive aggressive cost optimization strategies in AWS, focusing on high-cost resources like P-family instances, Spot instances, and large-scale S3 storage.

Skills

Required

Software Engineering
DevOps
MLOps
Kubernetes (EKS)
AWS (S3, RDS, Secrets Manager, CloudWatch)
Python
Go or Java
GitLab
Terraform
Datadog, Prometheus, or Weights & Biases
GitOps
ML-specific tools (e.g., ArgoCD, Kubeflow, or Metaflow)
microservice architectures

Nice to have

CUDA
NCCL
PyTorch
TensorFlow
Linux internals
high-speed networking
distributed computing
AWS Solutions Architect or Machine Learning Specialty certification

What the JD emphasized

mission-critical infrastructure

large-scale ML workloads

training

evaluation

simulation

Perception ML team

training throughput

GPU utilization

model iteration speed

Scalable Data Pipelines

massive-scale data preparation

autonomous driving models

ML Lifecycle CI/CD

automated evaluation

deployment to edge or cloud environments

GPU & Cloud Cost Management

aggressive cost optimization strategies

high-cost resources

P-family instances

Spot instances

large-scale S3 storage

Cross-Functional Collaboration

ML Research

deep learning architectures

4+ Yrs. experience in a Software Engineering, DevOps, or MLOps role

4+ Yrs. experience managing production-grade distributed systems

high-throughput data

compute-heavy workloads

Expertise in ML Infrastructure

Deep knowledge of Kubernetes (EKS)

scheduling GPU workloads

managing node groups

large-scale EFS/FSx for Lustre storage

Cloud Proficiency

Advanced skills in the AWS stack

high-performance computing (HPC) patterns

Automation & Coding

Hands-on proficiency in Python

Go or Java

GitLab for orchestration

Infrastructure as Code

Expert-level mastery of Terraform

reproducible ML environments

Monitoring & Observability

monitoring high-performance clusters

MLOps Patterns

GitOps

ML-specific tools

microservice architectures

Knowledge of CUDA

NCCL

deep learning frameworks (PyTorch/TensorFlow)

Linux internals

high-speed networking

distributed computing

AWS Solutions Architect or Machine Learning Specialty certification

About Rivian Rivian is on a mission to keep the world adventurous forever. This goes for the emissions-free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract. As a company, we constantly challenge what’s possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations. Role Summary As a member of Rivian's ADAS team, you'll be a key Software Engineer responsible for building, testing, and releasing mission-critical infrastructure services. Your work will directly support our autonomous driving initiatives by ensuring the reliability, scalability, and security of our cloud-based and on-premise systems. This role is crucial for automating processes and maintaining our CI/CD pipelines to enable the rapid development of safety-critical self-driving features. Responsibilities Lead ML Platform Engineering: Build, test, and release mission-critical infrastructure specifically designed for large-scale ML workloads (training, evaluation, and simulation) on AWS and on-prem. Optimize Training Performance: Partner with the Perception ML team to improve training throughput, GPU utilization, and model iteration speed. Scalable Data Pipelines: Setup fault-tolerant, multi-region environments for massive-scale data preparation and ingestion required for autonomous driving models. Own ML Lifecycle CI/CD: Design and maintain specialized pipelines for model versioning, automated evaluation, and seamless deployment to edge or cloud environments. GPU & Cloud Cost Management: Drive aggressive cost optimization strategies in AWS, focusing on high-cost resources like P-family instances, Spot instances, and large-scale S3 storage. Cross-Functional Collaboration: Act as the bridge between "generic" infrastructure and ML Research, ensuring the platform meets the unique requirements of deep learning architectures. Qualifications 4+ Yrs. experience in a Software Engineering, DevOps, or MLOps role. 4+ Yrs. experience managing production-grade distributed systems, specifically those handling high-throughput data or compute-heavy workloads. Expertise in ML Infrastructure: Deep knowledge of Kubernetes (EKS), specifically for scheduling GPU workloads, managing node groups, and handling large-scale EFS/FSx for Lustre storage. Cloud Proficiency: Advanced skills in the AWS stack (S3, RDS, Secrets Manager, CloudWatch) with a focus on high-performance computing (HPC) patterns. Automation & Coding: Hands-on proficiency in Python (required for ML scripting) and Go or Java, using GitLab for orchestration. Infrastructure as Code: Expert-level mastery of Terraform for reproducible ML environments. Monitoring & Observability: Experience monitoring high-performance clusters using Datadog, Prometheus, or Weights & Biases. MLOps Patterns: Experience with GitOps and ML-specific tools (e.g., ArgoCD, Kubeflow, or Metaflow) and microservice architectures. The "Plus" List: Knowledge of CUDA, NCCL, or deep learning frameworks (PyTorch/TensorFlow). Understanding of Linux internals, high-speed networking, and distributed computing. AWS Solutions Architect or Machine Learning Specialty certification. Equal Opportunity Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com. Candidate Data Privacy Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes (“Candidate Personal Data”). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian’s service providers, including providers of background checks, staffing services, and cloud services. Rivian may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please note that we are currently not accepting applications from third party application services.

Sr. Software Engineer, Infrastructure, Mlops, Autonomy

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals