Software Engineer, ML Infrastructure

Nuro Nuro · Robotics · CA · AI Platform

Software Engineer focused on building and evolving the core ML infrastructure platform for Nuro's self-driving technology. Responsibilities include scaling automated infrastructure-as-code, optimizing workload orchestration for massive-scale distributed training, designing robust pipelines for petabyte-scale sensor data, implementing feature caching and storage, and contributing to a unified ML platform. Requires experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems, with specific skills in IaC, workload scheduling, distributed data processing, and feature management.

What you'd actually do

  1. Scaling automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.
  2. Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.
  3. Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.
  4. Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.
  5. Contributing to a unified ML platform that abstracts complex cloud infrastructure for end-users.

Skills

Required

  • 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
  • Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane.
  • Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano).
  • Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation.
  • Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).
  • A strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing.

Nice to have

  • Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
  • Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
  • Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure).

What the JD emphasized

  • ML Infrastructure
  • large-scale infrastructure
  • workload orchestration
  • data processing
  • automated resource provisioning
  • high-performance workload scheduling
  • efficient feature management
  • petabyte-scale sensor and telemetry data
  • ML-ready formats
  • feature caching and storage solutions
  • unified ML platform
  • GPU/CPU nodes
  • distributed training
  • feature stores
  • high-performance computing

Other signals

  • ML Infrastructure
  • large-scale infrastructure
  • workload orchestration
  • data processing
  • automated resource provisioning
  • high-performance workload scheduling
  • efficient feature management
  • petabyte-scale sensor and telemetry data
  • ML-ready formats
  • feature caching and storage solutions
  • unified ML platform
  • GPU/CPU nodes
  • distributed training
  • feature stores
  • high-performance computing