Senior Scientist, Synthetic Data and Privacy

NVIDIA NVIDIA · Semiconductors · US, CA, Santa Clara, US, NY, Remote, US, CO, Remote, US, CA, Remote, US, MA, Remote

This role focuses on research and development of synthetic data generation and privacy-preserving AI techniques, contributing to open-source libraries within the NVIDIA NeMo ecosystem. It involves building advanced pipelines, researching privacy methods like DP-SGD and NER for PII, and designing software libraries. The role requires a PhD, significant research experience in data privacy and synthetic data, a strong publication record, and expertise in PyTorch, HuggingFace, and LLM inference frameworks.

What you'd actually do

  1. Build and implement advanced pipelines for generating synthetic datasets using innovative LLM-based methodologies and automated quality evaluation frameworks.
  2. Research and implement privacy-preserving techniques such as differentially private training (DP-SGD), identifying and replacing sensitive information via NER models, and membership inference protection.
  3. Design and maintain open-source software libraries and SDKs with clean APIs and developer-facing documentation, applying robust software design patterns.
  4. Drive software excellence through modern development tooling, architecture managed by configurations, and professional Git/CI-CD workflows.
  5. Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.

Skills

Required

  • PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience
  • 5+ years of research experience in synthetic data generation, data privacy, differential privacy, federated learning, or trustworthy machine learning
  • Proven track record of developing or maintaining software libraries used by a broad developer community
  • Deep technical understanding of PyTorch
  • Deep technical understanding of HuggingFace Transformers ecosystem including PEFT and LoRA
  • Technical familiarity with LLM inference frameworks such as vLLM or TGI
  • Strong publication record at premier venues

Nice to have

  • Active contributions to open-source projects
  • Specialized expertise with differential privacy concepts and tools such as Opacus
  • Ability to build and optimize scalable data processing pipelines for large-scale models
  • Proficiency with NER-based PII detection and advanced anonymization techniques
  • Functional knowledge of global privacy regulations such as GDPR or CCPA

What the JD emphasized

  • research background of 5+ years in synthetic data generation, data privacy, or related areas such as differential privacy, federated learning, or trustworthy machine learning is required
  • Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar

Other signals

  • synthetic data generation
  • privacy-preserving AI
  • open-source libraries
  • LLM-based methodologies
  • differentially private training
  • NER models
  • membership inference protection