Senior Research Manager, World Model Evaluation

NVIDIA · Semiconductors · Santa Clara, CA

Lead a research team focused on world-model evaluation and benchmarking for NVIDIA's Physical AI portfolio, defining the scientific roadmap for closed-system and open-system evaluations, developing benchmarks for various physical AI capabilities, and driving evaluation-to-model-improvement loops. The role requires publishing high-quality papers and establishing rigorous standards.

What you'd actually do

Lead a team of Research Scientists focused on world-model evaluation, benchmarking, and diagnostics for NVIDIA Physical AI models, including world foundation models, world-action models, synthetic data generation systems, robotics, simulation, and embodied AI workflows.
Define the scientific roadmap for closed-system and open-system evaluation, including open-loop and closed-loop benchmarks, metrics, failure taxonomy, model comparison, and evaluation-to-training feedback loops.
Develop benchmarks for physical plausibility, temporal consistency, scene dynamics, object permanence, spatial reasoning, action conditioning, affordances, controllability, long-horizon coherence, SDG quality, and WAM usefulness.
Develop open-system and mechanistic evaluation methods using model internals, including representation probing, causal interventions, activation analysis, ablations, sparse autoencoders, attention and feature analysis, and circuit-style diagnostics.
Drive evaluation-to-model-improvement loops with training, post-training, data curation, simulation, robotics, SDG, WAM, and applied research teams, including failure discovery, data generation, post-training priorities, model roadmap feedback, and re-evaluation.

Skills

Required

Strong research background in machine learning, computer vision, multimodal AI, robotics, world models, representation learning, model evaluation, or mechanistic interpretability.
Experience leading research teams, research programs, or cross-functional technical initiatives with measurable scientific and product impact.
Deep understanding of modern foundation models, including video models, vision-language-action models, diffusion or flow models, self-supervised learning, or world-model architectures.
Experience designing serious benchmarks, evaluation datasets, metrics, diagnostic tools, or model analysis frameworks for complex ML systems.
Familiarity with world-model evaluation and open-system analysis techniques, such as physical plausibility, temporal consistency, action conditioning, counterfactual reasoning, representation probing, activation patching, causal interventions, sparse autoencoders, or feature attribution.
PhD, or equivalent experience in Computer Science, Electrical Engineering, Robotics, Machine Learning, AI, or a related field
12+ overall years of relevant research or engineering experience
5+ years of management experience

Nice to have

Built influential benchmarks, evaluation suites, model diagnostics, or interpretability tools used by research or production teams.
Published in areas such as world models, video generation, physical AI, embodied AI, robotics, representation learning, mechanistic interpretability, self-supervised learning, or model evaluation.
Experience evaluating generative video models, action-conditioned world models, robotics foundation models, world-action models, synthetic data generation systems, simulation systems, or vision-language-action models.
Strong point of view on what current benchmarks miss, and excitement to build the next generation of evaluation science for Physical AI.

What the JD emphasized

world-model evaluation
Physical AI
closed-system evaluations
open-system evaluations
mechanistic evaluation
evaluation-to-model-improvement loops
model evaluation
mechanistic interpretability
evaluating generative video models
evaluating action-conditioned world models
evaluating robotics foundation models
evaluating world-action models
evaluating synthetic data generation systems
evaluating simulation systems
evaluating vision-language-action models

Other signals

leading research team
defining scientific roadmap
building closed improvement loop
publishing high-quality papers
establishing rigorous standards

Read full job description

At NVIDIA, we’re not just building the future, we’re generating it! Our world model team is pushing the boundaries of multimodal AI, robotics, and world foundation models for Physical AI. We are looking for a Senior Research Manager to lead world-model evaluation and benchmarking across NVIDIA’s Physical AI model portfolio. This role will build the team and research agenda for evaluating world models through closed-system evaluations, where the model under test is pluggable, and open-system evaluations, where access to model internals enables deeper diagnostics, causal analysis, and mechanistic evaluation.

This is not only about leaderboards. It is about defining what makes a world model useful for Physical AI, discovering model failures, and turning those findings into better data, training recipes, model roadmaps, and downstream systems. The team will build a closed improvement loop across model evaluation, failure discovery, data generation, post-training, and re-evaluation.

What you’ll be doing:

Lead a team of Research Scientists focused on world-model evaluation, benchmarking, and diagnostics for NVIDIA Physical AI models, including world foundation models, world-action models, synthetic data generation systems, robotics, simulation, and embodied AI workflows.
Define the scientific roadmap for closed-system and open-system evaluation, including open-loop and closed-loop benchmarks, metrics, failure taxonomy, model comparison, and evaluation-to-training feedback loops.
Develop benchmarks for physical plausibility, temporal consistency, scene dynamics, object permanence, spatial reasoning, action conditioning, affordances, controllability, long-horizon coherence, SDG quality, and WAM usefulness.
Develop open-system and mechanistic evaluation methods using model internals, including representation probing, causal interventions, activation analysis, ablations, sparse autoencoders, attention and feature analysis, and circuit-style diagnostics.
Drive evaluation-to-model-improvement loops with training, post-training, data curation, simulation, robotics, SDG, WAM, and applied research teams, including failure discovery, data generation, post-training priorities, model roadmap feedback, and re-evaluation.
Publish high-quality papers, technical reports, benchmarks, and open-source evaluation artifacts while establishing rigorous standards for validity, reproducibility, dataset hygiene, leakage prevention, and model comparison.

What we need to see:

Strong research background in machine learning, computer vision, multimodal AI, robotics, world models, representation learning, model evaluation, or mechanistic interpretability.
Experience leading research teams, research programs, or cross-functional technical initiatives with measurable scientific and product impact.
Deep understanding of modern foundation models, including video models, vision-language-action models, diffusion or flow models, self-supervised learning, or world-model architectures.
Experience designing serious benchmarks, evaluation datasets, metrics, diagnostic tools, or model analysis frameworks for complex ML systems.
Familiarity with world-model evaluation and open-system analysis techniques, such as physical plausibility, temporal consistency, action conditioning, counterfactual reasoning, representation probing, activation patching, causal interventions, sparse autoencoders, or feature attribution.
PhD, or equivalent experience in Computer Science, Electrical Engineering, Robotics, Machine Learning, AI, or a related field, with
12+ overall years of relevant research or engineering experience as well as 5+ years of management experience.
Ability to work onsite at NVIDIA’s Santa Clara headquarters; this is not a remote position.

Ways to stand out from the crowd:

Built influential benchmarks, evaluation suites, model diagnostics, or interpretability tools used by research or production teams.
Published in areas such as world models, video generation, physical AI, embodied AI, robotics, representation learning, mechanistic interpretability, self-supervised learning, or model evaluation.
Experience evaluating generative video models, action-conditioned world models, robotics foundation models, world-action models, synthetic data generation systems, simulation systems, or vision-language-action models.
Strong point of view on what current benchmarks miss, and excitement to build the next generation of evaluation science for Physical AI.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, passionate and self-motivated, we want to hear from you! NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 11, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.