Senior ML Evaluation Engineer - Autonom… at NVIDIA

What you'd actually do

Design and build learned evaluation pipelines that assess driving behavior using LLMs, VLMs, and multimodal models

Develop agentic workflows that chain model inference, retrieval, and structured reasoning to evaluate complex driving scenarios

Define evaluation-of-evaluation methodology — how do we know our learned evaluators are correct?

Build golden-set frameworks and calibration loops for learned metrics

Instrument evaluation systems with robust experiment tracking, A/B comparison tooling, and model versioning

Skills

Required

PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
Hands-on experience building LLM/VLM-based pipelines — fine-tuning, prompt engineering, retrieval-augmented generation, chain-of-thought
Track record of shipping ML systems to production (not just prototyping or publishing)
Strong software engineering fundamentals — you write clean, tested, reviewable code in Python and C++
Experience with evaluation methodology: precision/recall, inter-rater reliability, calibration, annotation pipelines
Comfort with large-scale data processing (Spark, Dask, or similar)
Strong Python skills. Experience with PyTorch or JAX. Comfortable with GPU-based training workflows.

Nice to have

Autonomous driving, robotics, or safety-critical domain experience
Familiarity with driving behavior taxonomies (cut-ins, hard braking events, lane-keeping metrics, scenario-based evaluation)
Experience with video understanding models or multi-modal evaluation. Knowledge of agentic AI frameworks (LangChain, DSPy, CrewAI, or custom)
Track record of influencing technical direction across team boundaries
Experience with LLM/VLM fine-tuning or application development

Other signals

building systems that bridge ML research and production evaluation

ship systems that run at scale on real-world driving data

produce metrics that block or green-light software releases

define how we measure whether an autonomous vehicle drives well

building the next generation of driving behavior evaluation

Want to join a fun, creative company that is on the cutting edge of outstanding technologies? NVIDIA is developing groundbreaking solutions in some of the most exciting technology areas globally, including Virtual Reality, Artificial Intelligence, Deep Learning and Autonomous Vehicles.

NVIDIA's AV Eval team is building the next generation of driving behavior evaluation — moving beyond hand-crafted rules to learned evaluation using LLMs, VLMs, and agentic workflows. You'll define how we measure whether an autonomous vehicle drives well, building systems that bridge ML research and production evaluation. You'll ship systems that run at scale on real-world driving data and produce metrics that block or green-light software releases. In this role you will get to work on next-gen AV evaluation and create a direct impact on vehicle safety and shipping decisions. Join a new team being built from scratch — high ownership, high visibility to NVIDIA AV leadership

What You will be doing:

Design and build learned evaluation pipelines that assess driving behavior using LLMs, VLMs, and multimodal models
Develop agentic workflows that chain model inference, retrieval, and structured reasoning to evaluate complex driving scenarios
Define evaluation-of-evaluation methodology — how do we know our learned evaluators are correct?
Build golden-set frameworks and calibration loops for learned metrics
Partner with AML (Alpamayo Logos) teams on model-specific eval needs (e.g., COT prediction quality, AML regression coverage)
Instrument evaluation systems with robust experiment tracking, A/B comparison tooling, and model versioning
Contribute to the team's transition from rule-based to learned evaluation: identify metrics and analyzers that are candidates for ML replacement and build the alternatives

What we need to see:

PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
Hands-on experience building LLM/VLM-based pipelines — fine-tuning, prompt engineering, retrieval-augmented generation, chain-of-thought
Track record of shipping ML systems to production (not just prototyping or publishing)
Strong software engineering fundamentals — you write clean, tested, reviewable code in Python and C++
Experience with evaluation methodology: precision/recall, inter-rater reliability, calibration, annotation pipelines
Comfort with large-scale data processing (Spark, Dask, or similar)
Strong Python skills. Experience with PyTorch or JAX. Comfortable with GPU-based training workflows.

Ways to stand out from the crowd:

Autonomous driving, robotics, or safety-critical domain experience
Familiarity with driving behavior taxonomies (cut-ins, hard braking events, lane-keeping metrics, scenario-based evaluation)
Experience with video understanding models or multi-modal evaluation. Knowledge of agentic AI frameworks (LangChain, DSPy, CrewAI, or custom)
Track record of influencing technical direction across team boundaries
Experience with LLM/VLM fine-tuning or application development

At NVIDIA, we’re dedicated to making self-driving vehicles a reality and believe this technology can save millions of lives. Join a team of innovative thinkers at one of the world’s most respected technology companies. If you’re motivated, curious, and ready to make a difference, we’d love to meet you! We believe that building self-driving vehicles will be a defining contribution of our generation (e.g. traffic accidents are responsible for ~1.25 million deaths per year world-wide). We have the funding and scale, but we need your help on our team. NVIDIA is widely considered to be one of the technology world’s most desirable employers with some of the most forward-thinking people in the world working here. If you're entrepreneurial and autonomous, we want to hear from you!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until April 19, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.