Member of Technical Staff - Research Engineer

Black Forest Labs Black Forest Labs · Multimodal · San Francisco, CA · Remote · Research

This role focuses on optimizing and stabilizing large-scale training systems for multimodal generative models. It involves deep work on GPU performance, distributed training, low-precision techniques, and debugging complex training issues, bridging the gap between research ideas and production reality.

What you'd actually do

  1. Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
  2. Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
  3. Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
  4. Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
  5. Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark

Skills

Required

  • PyTorch fluency
  • distributed training concepts (FSDP, parallelism, checkpointing, NCCL)
  • GPU workload profiling (Nsight Systems, Nsight Compute, torch profiler)
  • low-precision training and quantization tradeoffs
  • research judgment
  • operating in ambiguity

Nice to have

  • experience supporting frontier foundation model training
  • writing/improving GPU kernels
  • attention performance optimization
  • experience on Hopper or Blackwell-class GPUs
  • experience with diffusion, flow matching, DiT, and multimodal generative model training

What the JD emphasized

  • deep technical ownership
  • make progress in ambiguous training-system problems
  • verify your results
  • own the outcome
  • deeply on large-scale training systems
  • comfort reading and modifying low-level training code
  • Hands-on experience improving training throughput, memory footprint, or stability in real training runs
  • Practical GPU performance judgment
  • need the understanding to verify correctness, numerical behavior, and performance, and to own the result
  • Understanding of low-precision training and quantization tradeoffs
  • Good research judgment
  • Comfortable operating in ambiguity
  • supported or co-owned training for a frontier foundation model that shipped or reached a major release
  • written or substantially improved forward/backward GPU kernels
  • worked on attention performance, variable sequence length training, non-standard attention patterns
  • worked on low-precision training
  • experience with diffusion, flow matching, DiT, and multimodal generative model training

Other signals

  • large-scale training
  • GPU optimization
  • distributed training
  • low-precision training