Software Engineer - Data Infra Reliabil… at Luma AI

Where You Come In

As our models scale to "omni" capabilities, our data infrastructure must be unbreakable. We are looking for a Data Reliability Engineer who brings a Site Reliability Engineering (SRE) mindset to the world of massive-scale data. You will be responsible for the resilience, automation, and scalability of the petabyte-scale pipelines that feed our research. This is not just about keeping the lights on; it’s about treating infrastructure as code and building self-healing data systems that allow our researchers to train on massive datasets without interruption. Whether you are a junior engineer with a passion for automation or a seasoned SRE veteran, you will play a critical role in hardening the backbone of Luma’s intelligence.

What You'll Do

Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure.
Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs.
Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads.
Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms.
Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems.

Who You Are

Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation.
Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP).
Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers.
Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management.
Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store).

What Sets You Apart (Bonus Points)

Experience managing GPU clusters or AI/ML workloads.
Background in both Software Engineering and Operations (DevOps).
Experience with high-performance networking (InfiniBand/RDMA).

Where You Come In

What You'll Do

Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure.
Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs.
Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads.
Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms.
Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems.

Who You Are

Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation.
Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP).
Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers.
Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management.
Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store).

What Sets You Apart (Bonus Points)

Experience managing GPU clusters or AI/ML workloads.
Background in both Software Engineering and Operations (DevOps).
Experience with high-performance networking (InfiniBand/RDMA).