Senior Scientist, Synthetic Data Generation

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior Scientist focused on synthetic data generation for training frontier LLMs, contributing to open-source libraries and advancing multimodal data generation.

What you'd actually do

  1. Build synthetic data generation pipelines using LLM-based methods and automated quality evaluation, producing datasets that improve the pre- and post-training of LLMs such as Nemotron — reasoning, coding, structured output, and multimodal understanding.
  2. Advance multimodal synthetic data generation — image, document, video, and audio — in partnership with NVIDIA's model teams.
  3. Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
  4. Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD.
  5. Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.

Skills

Required

  • synthetic data generation
  • generative modeling
  • multimodal machine learning
  • LLMs
  • software libraries
  • publication record

Nice to have

  • open-source contributions
  • multimodal generation
  • vision-language
  • document AI
  • video
  • audio
  • scalable data pipelines
  • agentic
  • tool-use
  • reinforcement-learning post-training

What the JD emphasized

  • PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
  • A research background of 3+ years in synthetic data generation, generative modeling, multimodal machine learning, or related areas.
  • Deep technical understanding of LLMs, how data shapes their pre- and post-training, and inference frameworks such as vLLM or TGI.
  • Proven track record of developing or maintaining software libraries used by a broad developer community.
  • Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Other signals

  • synthetic data generation
  • training frontier models
  • open-source libraries