About Black Forest Labs

We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we're just getting started.

Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.

Why This Role

Large-scale training is where research ideas become real, and where many of the hardest problems are no longer cleanly separated into “research” or “engineering.” A promising architecture only matters if we can train it stably, efficiently, and correctly across large GPU fleets.

In this role, you will be embedded in production training and help where the hardest systems and performance problems arise: attention performance, custom kernels, low-precision training, profiling, memory behavior, data movement, distributed training stability, and throughput regressions. You will work directly with researchers, but your output will often be code, measurements, kernels, debugging tools, and training-system changes that make better research possible.

We are open to a range of seniority for this role. The common thread is deep technical ownership: you should be able to make progress in ambiguous training-system problems, verify your results, and own the outcome.

What You’ll Work On

Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark
Debug distributed training failures: NaNs, loss spikes, silent numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs
Build benchmarking and profiling harnesses that make performance claims trustworthy across hardware, shapes, sequence lengths, and training configurations
Help the training team move quickly when an urgent bottleneck appears, while turning repeated failures into better abstractions and tools

What We’re Looking For

Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers
Strong PyTorch fluency, including comfort reading and modifying low-level training code rather than only using high-level APIs
Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication
Hands-on experience improving training throughput, memory footprint, or stability in real training runs
Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry
Practical GPU performance judgment: you may use modern coding agents and tools as much as you want, but you need the understanding to verify correctness, numerical behavior, and performance, and to own the result
Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk
Good research judgment: you can partner with researchers on ablations, understand what the measurements do and do not prove, and keep optimization work tied to model-quality outcomes
Comfortable operating in ambiguity: sometimes the task is a clean implementation, sometimes it is a production fire, and sometimes it is figuring out which of three plausible explanations is actually true

We'd be especially excited if you:

Have supported or co-owned training for a frontier foundation model that shipped or reached a major release
Have written or substantially improved forward/backward GPU kernels, or have shown you can make progress on kernel-level work with strong measurement and validation discipline
Have worked on attention performance, variable sequence length training, non-standard attention patterns
Have experience on Hopper or Blackwell-class GPUs
Have worked on low-precision training
Have experience with diffusion, flow matching, DiT, and multimodal generative model training; if your deepest background is autoregressive or LLM training systems, you are excited to learn the diffusion and multimodal modeling stack quickly
Can move naturally between profiler traces, kernel code, distributed systems failures, and research discussions

How We Work Together

We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.

Everything we do is grounded in four values:

Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.
Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you’d enjoy, we’d love to hear from you.

Base Annual Salary:

**US **$180,000 - $290,000 + equity

About Black Forest Labs

Why This Role

What You’ll Work On

Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark
Debug distributed training failures: NaNs, loss spikes, silent numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs
Build benchmarking and profiling harnesses that make performance claims trustworthy across hardware, shapes, sequence lengths, and training configurations
Help the training team move quickly when an urgent bottleneck appears, while turning repeated failures into better abstractions and tools

What We’re Looking For

Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers
Strong PyTorch fluency, including comfort reading and modifying low-level training code rather than only using high-level APIs
Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication
Hands-on experience improving training throughput, memory footprint, or stability in real training runs
Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry
Practical GPU performance judgment: you may use modern coding agents and tools as much as you want, but you need the understanding to verify correctness, numerical behavior, and performance, and to own the result
Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk
Good research judgment: you can partner with researchers on ablations, understand what the measurements do and do not prove, and keep optimization work tied to model-quality outcomes
Comfortable operating in ambiguity: sometimes the task is a clean implementation, sometimes it is a production fire, and sometimes it is figuring out which of three plausible explanations is actually true

We'd be especially excited if you:

Have supported or co-owned training for a frontier foundation model that shipped or reached a major release
Have written or substantially improved forward/backward GPU kernels, or have shown you can make progress on kernel-level work with strong measurement and validation discipline
Have worked on attention performance, variable sequence length training, non-standard attention patterns
Have experience on Hopper or Blackwell-class GPUs
Have worked on low-precision training
Have experience with diffusion, flow matching, DiT, and multimodal generative model training; if your deepest background is autoregressive or LLM training systems, you are excited to learn the diffusion and multimodal modeling stack quickly
Can move naturally between profiler traces, kernel code, distributed systems failures, and research discussions

How We Work Together

Everything we do is grounded in four values:

Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.
Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you’d enjoy, we’d love to hear from you.

Base Annual Salary:

**US **$180,000 - $290,000 + equity

Member of Technical Staff - Research Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

About Black Forest Labs

Why This Role

What You’ll Work On

What We’re Looking For

How We Work Together

About Black Forest Labs

Why This Role

What You’ll Work On

What We’re Looking For

How We Work Together