Member of Technical Staff - Large Scale Data Infrastructure

Black Forest Labs Black Forest Labs · Multimodal · Freiburg, San Francisco · Engineering

Infrastructure engineer focused on building and optimizing data systems for large-scale AI model training runs at peta-to-exabyte scale, involving scalable data loaders, efficient storage, and multi-cloud object storage abstraction.

What you'd actually do

  1. Scalable data loaders for training runs across thousands of GPUs
  2. Efficient storage and retrieval systems for petabyte-scale datasets
  3. Multi-cloud object storage abstraction
  4. Execute large-scale data migrations across storage systems and providers
  5. Debug and resolve performance bottlenecks in distributed data loading

Skills

Required

  • Python
  • PyTorch DataLoader internals
  • Object storage (e.g. S3, Azure Blob, GCS)
  • Parquet for metadata
  • Video: ffmpeg, PyAV, codec fundamentals

Nice to have

  • Streaming dataset formats (e.g. WebDataset)
  • Video codec internals and frame-accurate seeking
  • Distributed systems experience
  • Slurm and Kubernetes for job orchestration
  • Object storage performance tuning across providers

What the JD emphasized

  • Built and operated data pipelines at petabyte scale
  • Optimized data loading
  • Worked with petabyte-scale video and image datasets
  • Written processing jobs operating on millions of files
  • Debugged distributed system bottlenecks across large fleets of machines

Other signals

  • petabyte-scale data systems
  • thousands of GPUs
  • data loaders for training runs
  • storage and retrieval systems
  • multi-cloud object storage abstraction