Senior Research Manager, World Model Evaluation

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Lead a research team focused on world-model evaluation and benchmarking for NVIDIA's Physical AI portfolio, defining the scientific roadmap for closed-system and open-system evaluations, developing benchmarks for various physical AI capabilities, and driving evaluation-to-model-improvement loops. The role requires publishing high-quality papers and establishing rigorous standards.

What you'd actually do

  1. Lead a team of Research Scientists focused on world-model evaluation, benchmarking, and diagnostics for NVIDIA Physical AI models, including world foundation models, world-action models, synthetic data generation systems, robotics, simulation, and embodied AI workflows.
  2. Define the scientific roadmap for closed-system and open-system evaluation, including open-loop and closed-loop benchmarks, metrics, failure taxonomy, model comparison, and evaluation-to-training feedback loops.
  3. Develop benchmarks for physical plausibility, temporal consistency, scene dynamics, object permanence, spatial reasoning, action conditioning, affordances, controllability, long-horizon coherence, SDG quality, and WAM usefulness.
  4. Develop open-system and mechanistic evaluation methods using model internals, including representation probing, causal interventions, activation analysis, ablations, sparse autoencoders, attention and feature analysis, and circuit-style diagnostics.
  5. Drive evaluation-to-model-improvement loops with training, post-training, data curation, simulation, robotics, SDG, WAM, and applied research teams, including failure discovery, data generation, post-training priorities, model roadmap feedback, and re-evaluation.

Skills

Required

  • Strong research background in machine learning, computer vision, multimodal AI, robotics, world models, representation learning, model evaluation, or mechanistic interpretability.
  • Experience leading research teams, research programs, or cross-functional technical initiatives with measurable scientific and product impact.
  • Deep understanding of modern foundation models, including video models, vision-language-action models, diffusion or flow models, self-supervised learning, or world-model architectures.
  • Experience designing serious benchmarks, evaluation datasets, metrics, diagnostic tools, or model analysis frameworks for complex ML systems.
  • Familiarity with world-model evaluation and open-system analysis techniques, such as physical plausibility, temporal consistency, action conditioning, counterfactual reasoning, representation probing, activation patching, causal interventions, sparse autoencoders, or feature attribution.
  • PhD, or equivalent experience in Computer Science, Electrical Engineering, Robotics, Machine Learning, AI, or a related field
  • 12+ overall years of relevant research or engineering experience
  • 5+ years of management experience

Nice to have

  • Built influential benchmarks, evaluation suites, model diagnostics, or interpretability tools used by research or production teams.
  • Published in areas such as world models, video generation, physical AI, embodied AI, robotics, representation learning, mechanistic interpretability, self-supervised learning, or model evaluation.
  • Experience evaluating generative video models, action-conditioned world models, robotics foundation models, world-action models, synthetic data generation systems, simulation systems, or vision-language-action models.
  • Strong point of view on what current benchmarks miss, and excitement to build the next generation of evaluation science for Physical AI.

What the JD emphasized

  • world-model evaluation
  • Physical AI
  • closed-system evaluations
  • open-system evaluations
  • mechanistic evaluation
  • evaluation-to-model-improvement loops
  • model evaluation
  • mechanistic interpretability
  • evaluating generative video models
  • evaluating action-conditioned world models
  • evaluating robotics foundation models
  • evaluating world-action models
  • evaluating synthetic data generation systems
  • evaluating simulation systems
  • evaluating vision-language-action models

Other signals

  • leading research team
  • defining scientific roadmap
  • building closed improvement loop
  • publishing high-quality papers
  • establishing rigorous standards