What you'd actually do

Scaling automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.

Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.

Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.

Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.

Contributing to a unified ML platform that abstracts complex cloud infrastructure for end-users.

Skills

Required

3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane.
Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano).
Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation.
Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).
A strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing.

Nice to have

Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure).

What the JD emphasized

ML Infrastructure

large-scale infrastructure

workload orchestration

data processing

automated resource provisioning

high-performance workload scheduling

efficient feature management

petabyte-scale sensor and telemetry data

ML-ready formats

feature caching and storage solutions

unified ML platform

GPU/CPU nodes

distributed training

feature stores

high-performance computing

Other signals

ML Infrastructure

large-scale infrastructure

workload orchestration

data processing

automated resource provisioning

high-performance workload scheduling

efficient feature management

petabyte-scale sensor and telemetry data

ML-ready formats

feature caching and storage solutions

unified ML platform

GPU/CPU nodes

distributed training

feature stores

high-performance computing

**Who We Are **

Nuro is a self-driving technology company on a mission to make autonomy accessible to all. Founded in 2016, Nuro is building the world’s most scalable driver, combining cutting-edge AI with automotive-grade hardware. Nuro licenses its core technology, the Nuro Driver™, to support a wide range of applications, from robotaxis and commercial fleets to personally owned vehicles. With technology proven over years of self-driving deployments, Nuro gives the automakers and mobility platforms a clear path to AVs at commercial scale—empowering a safer, richer, and more connected future.

**About the Role **

Nuro is seeking a Software Engineer with expertise in large-scale infrastructure, workload orchestration, and data processing to join our ML Infrastructure team. In this role, you will focus on building and evolving the core platform that provides researchers and engineers with seamless access to compute and data resources. You will be responsible for executing the technical strategy for automated resource provisioning, high-performance workload scheduling, and efficient feature management to accelerate the Nuro Driver™ development lifecycle.

**About the Work **

You will build the foundation that powers Nuro’s model development from experimentation to production. Key responsibilities include:

Resource Provisioning & IaC: Scaling automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.
Intelligent Scheduling: Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.
Data & ETL: Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.
Feature Management: Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.
Platform Abstraction: Contributing to a unified ML platform that abstracts complex cloud infrastructure for end-users.

**About You **

Experience: 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
Resource Provisioning: Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane.
Workload Scheduling: Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano).
Distributed Data Processing: Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation.
Feature Management: Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).
Systems Design: A strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing.

**Bonus Points **

Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure).

At Nuro, your base pay is one part of your total compensation package. For this position, the reasonably expected base pay range is between $160,360 and $240,540 for the level at which this job has been scoped. Your base pay will depend on several factors, including your experience, qualifications, education, location, and skills. In the event that you are considered for a different level, a higher or lower pay range would apply. This position is also eligible for an annual performance bonus, equity, and a competitive benefits package.

At Nuro, we celebrate differences and are committed to a diverse workplace that fosters inclusion and psychological safety for all employees. Nuro is proud to be an equal opportunity employer and expressly prohibits any form of workplace discrimination based on race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, veteran status, or any other legally protected characteristics.

**Who We Are **

**About the Role **

**About the Work **

You will build the foundation that powers Nuro’s model development from experimentation to production. Key responsibilities include:

Resource Provisioning & IaC: Scaling automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.
Intelligent Scheduling: Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.
Data & ETL: Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.
Feature Management: Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.
Platform Abstraction: Contributing to a unified ML platform that abstracts complex cloud infrastructure for end-users.

**About You **

Experience: 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
Resource Provisioning: Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane.
Workload Scheduling: Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano).
Distributed Data Processing: Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation.
Feature Management: Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).
Systems Design: A strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing.

**Bonus Points **

Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure).