Engineer, Supercomputing & Distributed Systems

Krea AI Krea AI · Multimodal · San Francisco, CA · Engineering

The role focuses on building and operating the infrastructure for AI research and inference, including distributed training, large GPU clusters, petabyte-scale data pipelines, custom distributed datastores, job orchestration systems, and streaming pipelines. It involves designing multi-stage data pipelines, managing distributed training and inference on large GPU clusters, scaling workloads, and optimizing dataloaders and networking for large training runs. The role requires strong systems thinking and experience with distributed systems, Python, Kubernetes, and data tools.

What you'd actually do

  1. Design multi-stage pipelines that turn petabytes of raw data into clean, annotated datasets
  2. Manage distributed training and inference on 1000+ GPU Kubernetes clusters
  3. Solve orchestration and scaling for large-scale GPU job processing
  4. Profile and optimize dataloaders streaming thousands of images per second
  5. Build fault tolerance systems for large-scale pretraining

Skills

Required

  • Python
  • Kubernetes
  • Torch
  • DuckDB
  • Arrow
  • distributed systems
  • containerization
  • operating systems
  • file-systems
  • networking
  • streaming and event processing systems

Nice to have

  • K8s experience
  • ML experience
  • PyArrow
  • SQL
  • massive relational databases
  • Pandas
  • NumPy
  • Designing and implementing large-scale ETL systems
  • Distributed training systems (NCCL, InfiniBand, RDMA)
  • PyTorch internals
  • custom dataloaders
  • training infrastructure

What the JD emphasized

  • build a lot of this from scratch
  • custom distributed datastores
  • job orchestration systems
  • streaming pipelines that replace tools like Kafka and Ray for modern AI workloads at scale
  • 1000+ GPU Kubernetes clusters
  • large-scale GPU job processing
  • large-scale pretraining
  • massive multimedia data
  • billions of images
  • millions of videos
  • billions of images
  • raw cluster capacity
  • distributed systems
  • custom distributed database
  • distributed systems
  • distributed training systems
  • streaming and event processing systems

Other signals

  • distributed training
  • GPU infrastructure
  • petabyte scale data pipelines
  • custom distributed datastores
  • job orchestration systems
  • streaming pipelines