Senior Deep Learning Scientist, Multimodal Conversational AI

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Deep Learning Scientist role focused on developing, training, fine-tuning, and deploying streaming multimodal conversational AI systems. This includes speech, audio, vision, voice chat, and action, as well as human-AI interaction. The role involves applying research to define algorithmic improvements and scale them through the Nemotron platform, working on high-impact LLM products.

What you'd actually do

  1. Develop, Train, Fine-tune, and Deploy streaming large language models to power multimodal conversational AI systems encompassing multimodal understanding, speech synthesis, speech-to-speech conversation, video generation, UI and animation rendering and control, environment interaction, and dialog reasoning and tool systems
  2. Apply brand-new fundamental and applied research to develop products for multimodal conversational artificial intelligence
  3. Apply techniques such as instruction tuning and reinforcement learning from human feedback (RLHF), reinforcement learning with verifiable reward (RLVR), and parameter-efficient finetuning methods like p-tuning, adapters, and LoRA. These methods improve embodied conversational LLMs for multiple use cases.
  4. Lead the collection, development, and labeling of domain-specific datasets to train LLMs for various multimodal tasks and applications
  5. Measure and benchmark model and application performance. Analyze model accuracy and bias and recommend the next course of action & improvements. Collaborate with various teams on new product features and improvements of existing products

Skills

Required

  • Master’s degree (or equivalent experience) or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or Applied Math with 8+ years of experience
  • Excellent programming skills in Python with strong fundamentals in programming, optimizations, and software development
  • Strong knowledge of ML/DL techniques, algorithms, and tools with exposure to CNN, RNN (LSTM), Transformers (ViT, BERT, BART, GPT/T5, Megatron, LLMs, MoEs)
  • Experience with training real-time audio language, streaming visual language, and streaming real-time audio-visual language models, and ViT, BERT, GPT, and Nemotron Models for different computer vision, NLP, and dialog system tasks using “PyTorch” Deep Learning Frameworks and performing data wrangling, tokenization, and multimodal alignment
  • Practical experience in natural language processing, speech/audio processing, computer vision, machine learning, and human-AI interaction
  • Hands-on experience on conversational AI Technologies like Natural Language Understanding, Natural Language Generation, Dialog systems (including system integration, state tracking, and action prediction), Information retrieval, Question and Answering, Machine Translation, etc.
  • Understanding of model development life cycle and experience with model development workflows & traceability, and versioning of datasets, including know-how of database management and queries (in SQL, MongoDB, etc.).
  • Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

Nice to have

  • Native or near-native fluency is required in one of these non-English languages: Spanish, Mandarin, German, Japanese, Russian, French, UK English, Arabic, Korean, Italian, or Portuguese.
  • Verified background in building LLMs that incorporate knowledge discovery along with reasoning abilities, including disambiguation, clarification, anticipation, and effective error handling for embodied AI systems
  • Validated experience adapting LLMs to different domains such as gaming, virtual assistants, video conferencing, and so on
  • Contributing experience in integrating embodied AI systems with various sensor inputs (camera, microphone, torch, and so on) and backend action fulfillment systems
  • Experience with long-term reasoning for embodied AI tasks (navigation, mobile manipulation, instruction following, and collaboration with humans) in gaming/physical environments, given natural-language instructions.

What the JD emphasized

  • streaming multimodal conversational AI
  • multimodal understanding
  • speech synthesis
  • speech-to-speech conversation
  • video generation
  • dialog reasoning and tool systems
  • instruction tuning
  • reinforcement learning from human feedback (RLHF)
  • reinforcement learning with verifiable reward (RLVR)
  • parameter-efficient finetuning methods
  • embodied conversational LLMs
  • domain-specific datasets
  • model accuracy and bias
  • conversational AI Technologies
  • embodied AI systems
  • long-term reasoning for embodied AI tasks

Other signals

  • multimodal conversational AI
  • large language models
  • streaming audio language
  • streaming visual language
  • streaming real-time audio-visual language models