Member of Technical Staff - ML Infrastructure Engineer

Black Forest Labs Black Forest Labs · Multimodal · Freiburg, San Francisco · Engineering

Designs, deploys, and maintains ML infrastructure for training and inference clusters, optimizing for researcher iteration speed and production inference performance. Focuses on cloud platforms, Kubernetes, Slurm, IaC, and CI/CD for ML workflows.

What you'd actually do

  1. Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on
  2. Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale
  3. Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things
  4. Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production
  5. Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short

Skills

Required

  • cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
  • Kubernetes and Slurm cluster management in production environments
  • Infrastructure as Code tools (Terraform, Ansible, etc.)
  • managing and optimizing network-based cloud file systems and object storage for ML workloads
  • CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts
  • security principles and best practices in cloud environments
  • monitoring and observability tools (Prometheus, Grafana, Loki, etc.)
  • ML workflows and GPU infrastructure management

Nice to have

  • building custom autoscaling solutions for ML workloads
  • cost optimization strategies for cloud-based ML infrastructure
  • MLOps practices and tools
  • high-performance computing (HPC) environments
  • data versioning and experiment tracking for ML
  • network optimization techniques for distributed ML training
  • multi-cloud or hybrid cloud architectures
  • container security and vulnerability scanning tools

What the JD emphasized

  • ML infrastructure at scale
  • supporting AI research is fundamentally different from traditional cloud infrastructure
  • paged because a training run failed
  • debugged why storage became the bottleneck
  • infrastructure that works when researchers depend on it for months-long experiments

Other signals

  • ML infrastructure backbone
  • frontier AI research
  • training clusters
  • inference clusters
  • ML workloads at scale
  • ML operations efficient