What you'd actually do

Lead and contribute to research efforts focused on real-time, multimodal generation—including text, image, video, and audio synthesis—as well as orchestration of agentic platform infrastructure

Design and prototype novel algorithms and architectures for high-fidelity, real-time multimodal synthesis and interactive experiences

Focus on real-time aspects of model inference and synthesis across modalities

Work on diffusion model distillation and/or develop diffusion-based world models for multimodal applications

Train and finetune autoregressive and diffusion models in LLM, VLM, or Audio LM contexts with a focus on real-time performance

Skills

Required

Python
PyTorch
TensorFlow
large language models
vision-language models
audio language models
deep learning
generative models
autoregressive models
diffusion models
real-time systems
agentic orchestration

Nice to have

video synthesis
audio synthesis
diffusion model distillation
world models
multimodal datasets

What the JD emphasized

5+ years of relevant experience

Demonstrated impact as first author on major publications in top conferences or journals

Deep expertise in at least one area: language modeling (LLM), vision-language modeling (VLM), or audio language modeling (Audio LM)

Strong experience with generative models, including autoregressive and diffusion models, and their real-time deployment

Experience developing and deploying real-time systems and/or agentic orchestration infrastructure

Multimodal LLM Researcher (MLLM)

About the Role

At Pika, we are pioneering next-generation creative infrastructure built around real-time, multimodal generation and intelligent, agentic platforms. We are seeking accomplished Multimodal LLM Researchers (LLM, VLM, and Audio LM) to drive forward our mission to make agentic real-time generative technology accessible, dynamic, and transformative for millions of creators.

As a core member of our research team, you will be integral to designing and building foundational technologies, developing novel approaches for large multimodal language models (LLMs/VLMs/Audio LMs), and orchestrating intelligent agentic systems that power scalable, interactive multimedia experiences. You will collaborate closely with engineering and product teams, shaping the future of real-time creative platforms.

What You’ll Do

Lead and contribute to research efforts focused on real-time, multimodal generation—including text, image, video, and audio synthesis—as well as orchestration of agentic platform infrastructure
Design and prototype novel algorithms and architectures for high-fidelity, real-time multimodal synthesis and interactive experiences
Focus on real-time aspects of model inference and synthesis across modalities
Work on diffusion model distillation and/or develop diffusion-based world models for multimodal applications
Train and finetune autoregressive and diffusion models in LLM, VLM, or Audio LM contexts with a focus on real-time performance
Curate specific datasets, especially for video, audio, cross-modal, and sensory-rich data
Collaborate with cross-functional teams to bring research advancements into production-ready technologies
Publish work in top-tier conferences and journals; communicate research results internally and externally
Stay at the cutting edge of real-time multimodal generative AI and agentic orchestration

What We’re Looking For

5+ years of relevant experience, including research during graduate studies, in large language models, vision-language models, audio language models, deep learning, or related fields
Demonstrated impact as first author on major publications in top conferences or journals (e.g., NeurIPS, CVPR, ICML, ICCV, SIGGRAPH, Interspeech, etc.)
Deep expertise in at least one area: language modeling (LLM), vision-language modeling (VLM), or audio language modeling (Audio LM)
Strong experience with generative models, including autoregressive and diffusion models, and their real-time deployment
Hands-on experience curating, constructing, or augmenting large, high-quality multimodal datasets
Experience developing and deploying real-time systems and/or agentic orchestration infrastructure
Strong programming and prototyping skills (Python, PyTorch, TensorFlow, etc.)
Passion for building creative tools and platforms that empower users
Excellent communication and collaboration skills

What We Offer

Competitive salary and substantial equity in a high-growth startup
Full health benefits + 401k matching and more
Collaborative, mission-driven team environment with major growth opportunities
Flexible on-site/remote hybrid (HQ in Palo Alto, CA)

About Pika

Pika empowers creators by building state-of-the-art agentic and multimedia platforms. Our vision is to break down technical barriers to creativity, making real-time generative and intelligent orchestration accessible to all. Join us and shape the next evolution of creative technology!

If you are a leading researcher excited by real-time multimodal AI and agentic platforms, we want to hear from you.

About the Role

What You’ll Do

Lead and contribute to research efforts focused on real-time, multimodal generation—including text, image, video, and audio synthesis—as well as orchestration of agentic platform infrastructure

Design and prototype novel algorithms and architectures for high-fidelity, real-time multimodal synthesis and interactive experiences

Focus on real-time aspects of model inference and synthesis across modalities

Work on diffusion model distillation and/or develop diffusion-based world models for multimodal applications

Train and finetune autoregressive and diffusion models in LLM, VLM, or Audio LM contexts with a focus on real-time performance

Curate specific datasets, especially for video, audio, cross-modal, and sensory-rich data

Collaborate with cross-functional teams to bring research advancements into production-ready technologies

Publish work in top-tier conferences and journals; communicate research results internally and externally

Stay at the cutting edge of real-time multimodal generative AI and agentic orchestration

What We’re Looking For

5+ years of relevant experience, including research during graduate studies, in large language models, vision-language models, audio language models, deep learning, or related fields

Demonstrated impact as first author on major publications in top conferences or journals (e.g., NeurIPS, CVPR, ICML, ICCV, SIGGRAPH, Interspeech, etc.)

Deep expertise in at least one area: language modeling (LLM), vision-language modeling (VLM), or audio language modeling (Audio LM)

Strong experience with generative models, including autoregressive and diffusion models, and their real-time deployment

Hands-on experience curating, constructing, or augmenting large, high-quality multimodal datasets

Experience developing and deploying real-time systems and/or agentic orchestration infrastructure

Strong programming and prototyping skills (Python, PyTorch, TensorFlow, etc.)

Passion for building creative tools and platforms that empower users

Excellent communication and collaboration skills

What We Offer

Competitive salary and substantial equity in a high-growth startup

Full health benefits + 401k matching and more

Collaborative, mission-driven team environment with major growth opportunities

Flexible on-site/remote hybrid (HQ in Palo Alto, CA)

About Pika

If you are a leading researcher excited by real-time multimodal AI and agentic platforms, we want to hear from you.

Multimodal LLM Researcher (mllm)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Multimodal LLM Researcher (MLLM)

About the Role

What You’ll Do

What We’re Looking For

What We Offer

About Pika

Multimodal LLM Researcher (MLLM)

About the Role

What You’ll Do

What We’re Looking For

What We Offer

About Pika