About Black Forest Labs

We're the team behind Latent Diffusion, Stable Diffusion, and FLUX — foundational technologies that changed how the world creates images and video. Our models power the tools used by millions of creators, developers, and businesses worldwide, and FLUX is among the most advanced generative systems in the world.

Headquartered in Freiburg, Germany with a growing presence in San Francisco, we're scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.

Why This Role

Vision-language models are becoming foundational to how people interact with generative AI — but most VLM research happens in isolation from the generation stack. At Black Forest Labs, we're integrating VLMs directly into FLUX in ways that make our models more powerful, more controllable, and more aligned with what creators actually want.

This role is about pioneering that integration. You won't be applying off-the-shelf VLMs — you'll develop novel approaches, innovate on architectures, and answer questions that haven't been solved yet: how vision and language representations inform each other, how multimodal understanding improves generation quality, and how to make these capabilities deployable at scale without compromising what makes FLUX exceptional.

This is a Staff / Senior IC role. We're looking for someone who has pretrained or significantly advanced a VLM, not just fine-tuned one.

What You'll Work On

Lead development and training of state-of-the-art multimodal vision-language models within the FLUX stack — innovating on architectures, not just applying existing ones
Design fine-tuning strategies that adapt VLMs to specialized creative use cases (captioning, editing instructions, prompt enhancement) that general-purpose models can't handle
Research integrations between VLM/LLM capabilities and our diffusion and flow pipelines — finding creative ways to improve generation quality and controllability without computational bottlenecks
Evaluate emerging multimodal architectures, translating the best of recent research into practical improvements

What We're Looking For

You've pretrained or significantly advanced a VLM (not just SFT'd or LoRA'd one) that was deployed in a production system or released publicly
Strong publication record or unambiguous production track record showing you push the frontier on multimodal architectures
Deep understanding of how vision and language representations interact: tokenization, alignment, grounding, cross-modal attention, and the failure modes of each
Experience with distributed training at multi-node scale
Comfortable at the research/production boundary — you care whether the work ships and generalizes, not just whether it reads well
Experience with diffusion or flow-based generative models is a strong plus — especially if you've thought about how autoregressive and diffusion paradigms can compose

How We Work Together

We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.

Everything we do is grounded in four values:

Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.
Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you’d enjoy, we’d love to hear from you.

About Black Forest Labs

Why This Role

This is a Staff / Senior IC role. We're looking for someone who has pretrained or significantly advanced a VLM, not just fine-tuned one.

What You'll Work On

Lead development and training of state-of-the-art multimodal vision-language models within the FLUX stack — innovating on architectures, not just applying existing ones
Design fine-tuning strategies that adapt VLMs to specialized creative use cases (captioning, editing instructions, prompt enhancement) that general-purpose models can't handle
Research integrations between VLM/LLM capabilities and our diffusion and flow pipelines — finding creative ways to improve generation quality and controllability without computational bottlenecks
Evaluate emerging multimodal architectures, translating the best of recent research into practical improvements

What We're Looking For

You've pretrained or significantly advanced a VLM (not just SFT'd or LoRA'd one) that was deployed in a production system or released publicly
Strong publication record or unambiguous production track record showing you push the frontier on multimodal architectures
Deep understanding of how vision and language representations interact: tokenization, alignment, grounding, cross-modal attention, and the failure modes of each
Experience with distributed training at multi-node scale
Comfortable at the research/production boundary — you care whether the work ships and generalizes, not just whether it reads well
Experience with diffusion or flow-based generative models is a strong plus — especially if you've thought about how autoregressive and diffusion paradigms can compose

How We Work Together

Everything we do is grounded in four values:

Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.
Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you’d enjoy, we’d love to hear from you.

Member of Technical Staff - Vlm

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

About Black Forest Labs

Why This Role

What You'll Work On

What We're Looking For

How We Work Together

About Black Forest Labs

Why This Role

What You'll Work On

What We're Looking For

How We Work Together