Software Engineer, ML Infrastructure

Nuro Nuro · Robotics · CA · N/A

Software Engineer focused on building and evolving the core ML infrastructure platform for Nuro's self-driving vehicles. Responsibilities include scaling automated resource provisioning (IaC), designing intelligent workload orchestration for massive-scale distributed training, building robust pipelines for petabyte-scale sensor data transformation, and implementing feature caching/storage solutions for low-latency access. The role aims to abstract complex cloud infrastructure for researchers and engineers to accelerate Nuro Driver™ development.

What you'd actually do

  1. Scaling automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.
  2. Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.
  3. Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.
  4. Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.
  5. Contributing to a unified ML platform that abstracts complex cloud infrastructure for end-users.

Skills

Required

  • 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
  • Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane.
  • Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano).
  • Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation.
  • Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).
  • A strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing.

Nice to have

  • Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
  • Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
  • Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure).

What the JD emphasized

  • large-scale infrastructure
  • workload orchestration
  • data processing
  • automated resource provisioning
  • high-performance workload scheduling
  • efficient feature management
  • GPU/CPU nodes
  • distributed training
  • petabyte-scale sensor and telemetry data
  • ML-ready formats
  • feature caching and storage
  • unified ML platform

Other signals

  • ML Infrastructure
  • large-scale infrastructure
  • workload orchestration
  • data processing
  • automated resource provisioning
  • high-performance workload scheduling
  • efficient feature management
  • GPU/CPU nodes
  • distributed training
  • petabyte-scale sensor and telemetry data
  • ML-ready formats
  • feature caching and storage
  • unified ML platform