Multimodal Generative AI Researcher

at Stability AI · AI Frontier · Remote · Research

Research Scientist focused on training and fine-tuning large Vision-Language and Language Models (VLMs/LLMs) for multimodal tasks involving vision, language, and 3D. The role involves designing and implementing training/evaluation pipelines, analyzing model performance, and collaborating on bringing models to production, with an emphasis on publishing research.

What you'd actually do

  1. Design and fine-tune large-scale VLMs / LLMs — and hybrid architectures — for tasks such as visual reasoning, retrieval, 3D understanding, and embodied interaction.
  2. Build robust, efficient training and evaluation pipelines (data curation, distributed training, mixed precision, scalable fine-tuning).
  3. Conduct in-depth analysis of model performance: ablations, bias / robustness checks, and generalisation studies.
  4. Collaborate across research, engineering, and 3D / graphics teams to bring models from prototype to production.
  5. Publish impactful research and help establish best practices for multimodal model adaptation.

Skills

Required

  • PhD (or equivalent experience) in Machine Learning, Computer Vision, NLP, Robotics, or Computer Graphics.
  • Proven track record in fine-tuning or training large-scale VLMs / LLMs for real-world downstream tasks.
  • Strong engineering mindset — you can design, debug, and scale training systems end-to-end.
  • Deep understanding of multimodal alignment and representation learning (vision–language fusion, CLIP-style pre-training, retrieval-augmented generation).
  • Familiarity with recent trends, including video-language and long-context VLMs, spatio-temporal grounding, agentic multimodal reasoning, and Mixture-of-Experts (MoE) fine-tuning.
  • Awareness of 3D-aware multimodal models — using NeRFs, Gaussian splatting, or differentiable renderers for grounded reasoning and 3D scene understanding.
  • Hands-on experience with PyTorch / DeepSpeed / Ray and distributed or mixed-precision training.
  • Excellent communication skills and a collaborative mindset.

Nice to have

  • Experience integrating 3D and graphics pipelines into training workflows (e.g., mesh or point-cloud encoding, differentiable rendering, 3D VLMs).
  • Research or implementation experience with vision-language-action models, world-model-style architectures, or multimodal agents that perceive and act.
  • Familiarity with efficient adaptation methods — LoRA, adapters, QLoRA, parameter-efficient finetuning, and distillation for edge deployment.
  • Knowledge of video and 4D generation trends, latent diffusion / rectified flow methods, or multimodal retrieval and reasoning pipelines.
  • Background in GPU optimisation, quantisation, or model compression for real-time inference.
  • Open-source or publication track record in top-tier ML / CV / NLP venues.

What the JD emphasized

  • training and fine-tuning large Vision-Language and Language Models (VLMs / LLMs)
  • fine-tuning or training large-scale VLMs / LLMs

Other signals

  • training large-scale VLMs / LLMs
  • fine-tuning large-scale VLMs / LLMs
  • multimodal alignment
  • representation learning
  • vision–language fusion
  • CLIP-style pre-training
  • retrieval-augmented generation
  • video-language
  • long-context VLMs
  • spatio-temporal grounding
  • agentic multimodal reasoning
  • Mixture-of-Experts (MoE) fine-tuning
  • 3D-aware multimodal models
  • NeRFs
  • Gaussian splatting
  • differentiable renderers
  • grounded reasoning
  • 3D scene understanding
Read full job description

Multimodal Generative AI Researcher

Location: Remote

About the Role

We’re looking for a Research Scientist with deep expertise in training and fine-tuning large Vision-Language and Language Models (VLMs / LLMs) for downstream multimodal tasks. You’ll help push the next frontier of models that reason across vision, language, and 3D, bridging research breakthroughs with scalable engineering.

What You’ll Do

  • Design and fine-tune large-scale VLMs / LLMs — and hybrid architectures — for tasks such as visual reasoning, retrieval, 3D understanding, and embodied interaction.
  • Build robust, efficient training and evaluation pipelines (data curation, distributed training, mixed precision, scalable fine-tuning).
  • Conduct in-depth analysis of model performance: ablations, bias / robustness checks, and generalisation studies.
  • Collaborate across research, engineering, and 3D / graphics teams to bring models from prototype to production.
  • Publish impactful research and help establish best practices for multimodal model adaptation.

What You Bring

  • PhD (or equivalent experience) in Machine Learning, Computer Vision, NLP, Robotics, or Computer Graphics.
  • Proven track record in fine-tuning or training large-scale VLMs / LLMs for real-world downstream tasks.
  • Strong engineering mindset — you can design, debug, and scale training systems end-to-end.
  • Deep understanding of multimodal alignment and representation learning (vision–language fusion, CLIP-style pre-training, retrieval-augmented generation).
  • Familiarity with recent trends, including video-language and long-context VLMs, spatio-temporal grounding, agentic multimodal reasoning, and Mixture-of-Experts (MoE) fine-tuning.
  • Awareness of 3D-aware multimodal models — using NeRFs, Gaussian splatting, or differentiable renderers for grounded reasoning and 3D scene understanding.
  • Hands-on experience with PyTorch / DeepSpeed / Ray and distributed or mixed-precision training.
  • Excellent communication skills and a collaborative mindset.

Bonus / Preferred

  • Experience integrating 3D and graphics pipelines into training workflows (e.g., mesh or point-cloud encoding, differentiable rendering, 3D VLMs).
  • Research or implementation experience with vision-language-action models, world-model-style architectures, or multimodal agents that perceive and act.
  • Familiarity with efficient adaptation methods — LoRA, adapters, QLoRA, parameter-efficient finetuning, and distillation for edge deployment.
  • Knowledge of video and 4D generation trends, latent diffusion / rectified flow methods, or multimodal retrieval and reasoning pipelines.
  • Background in GPU optimisation, quantisation, or model compression for real-time inference.
  • Open-source or publication track record in top-tier ML / CV / NLP venues.

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.