Senior Deep Learning Scientist, Multimodal Conversational AI

NVIDIA · Semiconductors · Santa Clara, CA

Senior Deep Learning Scientist role focused on developing, training, fine-tuning, and deploying streaming multimodal conversational AI systems. This includes speech, audio, vision, voice chat, and action, as well as human-AI interaction. The role involves applying research to define algorithmic improvements and scale them through the Nemotron platform, working on high-impact LLM products.

What you'd actually do

Develop, Train, Fine-tune, and Deploy streaming large language models to power multimodal conversational AI systems encompassing multimodal understanding, speech synthesis, speech-to-speech conversation, video generation, UI and animation rendering and control, environment interaction, and dialog reasoning and tool systems
Apply brand-new fundamental and applied research to develop products for multimodal conversational artificial intelligence
Apply techniques such as instruction tuning and reinforcement learning from human feedback (RLHF), reinforcement learning with verifiable reward (RLVR), and parameter-efficient finetuning methods like p-tuning, adapters, and LoRA. These methods improve embodied conversational LLMs for multiple use cases.
Lead the collection, development, and labeling of domain-specific datasets to train LLMs for various multimodal tasks and applications
Measure and benchmark model and application performance. Analyze model accuracy and bias and recommend the next course of action & improvements. Collaborate with various teams on new product features and improvements of existing products

Skills

Required

Master’s degree (or equivalent experience) or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or Applied Math with 8+ years of experience
Excellent programming skills in Python with strong fundamentals in programming, optimizations, and software development
Strong knowledge of ML/DL techniques, algorithms, and tools with exposure to CNN, RNN (LSTM), Transformers (ViT, BERT, BART, GPT/T5, Megatron, LLMs, MoEs)
Experience with training real-time audio language, streaming visual language, and streaming real-time audio-visual language models, and ViT, BERT, GPT, and Nemotron Models for different computer vision, NLP, and dialog system tasks using “PyTorch” Deep Learning Frameworks and performing data wrangling, tokenization, and multimodal alignment
Practical experience in natural language processing, speech/audio processing, computer vision, machine learning, and human-AI interaction
Hands-on experience on conversational AI Technologies like Natural Language Understanding, Natural Language Generation, Dialog systems (including system integration, state tracking, and action prediction), Information retrieval, Question and Answering, Machine Translation, etc.
Understanding of model development life cycle and experience with model development workflows & traceability, and versioning of datasets, including know-how of database management and queries (in SQL, MongoDB, etc.).
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

Nice to have

Native or near-native fluency is required in one of these non-English languages: Spanish, Mandarin, German, Japanese, Russian, French, UK English, Arabic, Korean, Italian, or Portuguese.
Verified background in building LLMs that incorporate knowledge discovery along with reasoning abilities, including disambiguation, clarification, anticipation, and effective error handling for embodied AI systems
Validated experience adapting LLMs to different domains such as gaming, virtual assistants, video conferencing, and so on
Contributing experience in integrating embodied AI systems with various sensor inputs (camera, microphone, torch, and so on) and backend action fulfillment systems
Experience with long-term reasoning for embodied AI tasks (navigation, mobile manipulation, instruction following, and collaboration with humans) in gaming/physical environments, given natural-language instructions.

What the JD emphasized

streaming multimodal conversational AI
multimodal understanding
speech synthesis
speech-to-speech conversation
video generation
dialog reasoning and tool systems
instruction tuning
reinforcement learning from human feedback (RLHF)
reinforcement learning with verifiable reward (RLVR)
parameter-efficient finetuning methods
embodied conversational LLMs
domain-specific datasets
model accuracy and bias
conversational AI Technologies
embodied AI systems
long-term reasoning for embodied AI tasks

Other signals

multimodal conversational AI
large language models
streaming audio language
streaming visual language
streaming real-time audio-visual language models

Read full job description

NVIDIA is widely regarded as one of the most desirable employers in technology. It leads in High-Performance Computing, Artificial Intelligence, and Visualization. Our invention, the GPU, acts as the visual cortex of modern computers and powers our products. GPU deep learning sparked modern AI, the next computing era. The GPU serves as the brain for computers, robots, autonomous cars, and conversational AI that understand the world. Today, we are known as “the AI computing company.” We want to grow and hire the smartest people. Join us at the forefront of technology.

NVIDIA is hiring Senior Deep Learning Scientists interested in streaming multimodal conversational AI, including speech, audio, vision, voice chat, and action, as well as human-AI interaction. You will demonstrate foundational expertise in deep learning, reinforcement learning, computational statistics, and applied mathematics. You will have a chance to define core algorithmic improvements and scale your ideas through our Nemotron platform. You will work on high-impact, high-visibility large language model products that improve the experience for millions of users. If you are creative and passionate about real-world conversational AI issues, come join our Nemotron LLM team. For more details on Nemotron LLM, check https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/

What you’ll be doing:

Develop, Train, Fine-tune, and Deploy streaming large language models to power multimodal conversational AI systems encompassing multimodal understanding, speech synthesis, speech-to-speech conversation, video generation, UI and animation rendering and control, environment interaction, and dialog reasoning and tool systems
Apply brand-new fundamental and applied research to develop products for multimodal conversational artificial intelligence
Apply techniques such as instruction tuning and reinforcement learning from human feedback (RLHF), reinforcement learning with verifiable reward (RLVR), and parameter-efficient finetuning methods like p-tuning, adapters, and LoRA. These methods improve embodied conversational LLMs for multiple use cases.
Lead the collection, development, and labeling of domain-specific datasets to train LLMs for various multimodal tasks and applications
Measure and benchmark model and application performance. Analyze model accuracy and bias and recommend the next course of action & improvements. Collaborate with various teams on new product features and improvements of existing products
Participate in developing and reviewing code, building documents, and conducting use case reviews and test plan reviews. Help innovate, identify problems, recommend solutions, and perform triage in a collaborative team environment

What we need to see:

Master’s degree (or equivalent experience) or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or Applied Math with 8+ years of experience
Excellent programming skills in Python with strong fundamentals in programming, optimizations, and software development
Strong knowledge of ML/DL techniques, algorithms, and tools with exposure to CNN, RNN (LSTM), Transformers (ViT, BERT, BART, GPT/T5, Megatron, LLMs, MoEs)
Experience with training real-time audio language, streaming visual language, and streaming real-time audio-visual language models, and ViT, BERT, GPT, and Nemotron Models for different computer vision, NLP, and dialog system tasks using “PyTorch” Deep Learning Frameworks and performing data wrangling, tokenization, and multimodal alignment
Practical experience in natural language processing, speech/audio processing, computer vision, machine learning, and human-AI interaction
Hands-on experience on conversational AI Technologies like Natural Language Understanding, Natural Language Generation, Dialog systems (including system integration, state tracking, and action prediction), Information retrieval, Question and Answering, Machine Translation, etc.
Understanding of model development life cycle and experience with model development workflows & traceability, and versioning of datasets, including know-how of database management and queries (in SQL, MongoDB, etc.).
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

Ways to stand out from the crowd:

Native or near-native fluency is required in one of these non-English languages: Spanish, Mandarin, German, Japanese, Russian, French, UK English, Arabic, Korean, Italian, or Portuguese.
Verified background in building LLMs that incorporate knowledge discovery along with reasoning abilities, including disambiguation, clarification, anticipation, and effective error handling for embodied AI systems
Validated experience adapting LLMs to different domains such as gaming, virtual assistants, video conferencing, and so on
Contributing experience in integrating embodied AI systems with various sensor inputs (camera, microphone, torch, and so on) and backend action fulfillment systems
Experience with long-term reasoning for embodied AI tasks (navigation, mobile manipulation, instruction following, and collaboration with humans) in gaming/physical environments, given natural-language instructions.

With competitive salaries and a generous benefits package, NVIDIA is considered one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking individuals in the industry working for us. Due to unprecedented growth, our exclusive engineering teams are expanding rapidly.

If you're a creative and autonomous engineer with a genuine passion for technology, we want to hear from you!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until March 8, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.