Senior Scientist, Synthetic Data and Privacy

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior Scientist role focused on building LLM-based methods for synthetic data generation and privacy-preserving AI, contributing to open-source libraries within the NVIDIA NeMo ecosystem. The role involves applied research, software engineering, and optimizing LLMs for inference, with a strong emphasis on publishing original research.

What you'd actually do

  1. Build LLM-based methods for synthetic data generation, privacy, and context-aware anonymization, with automated evaluation across multilingual text, documents, and multimodal content.
  2. Optimize task-specific LLMs for low-latency, high-throughput inference (distillation, quantization), and scale our frameworks to run in real time.
  3. Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
  4. Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD.
  5. Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.

Skills

Required

  • LLM/NLP research and engineering
  • synthetic data generation
  • anonymization and PII detection
  • software libraries development
  • publication record

Nice to have

  • open-source contributions
  • LLM inference optimization
  • quantization
  • distillation
  • latency/throughput tuning
  • vLLM
  • TGI
  • scalable data processing pipelines
  • global privacy regulations (GDPR, CCPA)

What the JD emphasized

  • PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
  • A research background of 2+ years in applied LLM/NLP research and engineering, synthetic data generation, anonymization and PII detection, or related areas.
  • Proven track record of developing or maintaining software libraries used by a broad developer community.
  • Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Other signals

  • synthetic data generation
  • privacy-preserving AI
  • LLM-based methods
  • open-source libraries