Senior ML Evaluation Engineer - Autonomous Vehicles

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

NVIDIA is seeking a Senior ML Evaluation Engineer for their Autonomous Vehicles team. The role involves designing and building learned evaluation pipelines using LLMs, VLMs, and agentic workflows to assess driving behavior. The engineer will define evaluation methodologies, build golden-set frameworks, and contribute to the transition from rule-based to learned evaluation systems. This position requires a strong background in ML system development, software engineering, and experience with large-scale data processing, with a focus on shipping production ML systems.

What you'd actually do

  1. Design and build learned evaluation pipelines that assess driving behavior using LLMs, VLMs, and multimodal models
  2. Develop agentic workflows that chain model inference, retrieval, and structured reasoning to evaluate complex driving scenarios
  3. Define evaluation-of-evaluation methodology — how do we know our learned evaluators are correct?
  4. Build golden-set frameworks and calibration loops for learned metrics
  5. Instrument evaluation systems with robust experiment tracking, A/B comparison tooling, and model versioning

Skills

Required

  • PhD with 4+ years, MS with 6+ years, or BS (or equivalent experience) with 8+ years of relevant experience in Computer Science, Computer Engineering, or a related technical field.
  • Hands-on experience building LLM/VLM-based pipelines — fine-tuning, prompt engineering, retrieval-augmented generation, chain-of-thought
  • Track record of shipping ML systems to production (not just prototyping or publishing)
  • Strong software engineering fundamentals — you write clean, tested, reviewable code in Python and C++
  • Experience with evaluation methodology: precision/recall, inter-rater reliability, calibration, annotation pipelines
  • Comfort with large-scale data processing (Spark, Dask, or similar)
  • Strong Python skills. Experience with PyTorch or JAX. Comfortable with GPU-based training workflows.

Nice to have

  • Autonomous driving, robotics, or safety-critical domain experience
  • Familiarity with driving behavior taxonomies (cut-ins, hard braking events, lane-keeping metrics, scenario-based evaluation)
  • Experience with video understanding models or multi-modal evaluation. Knowledge of agentic AI frameworks (LangChain, DSPy, CrewAI, or custom)
  • Track record of influencing technical direction across team boundaries
  • Experience with LLM/VLM fine-tuning or application development

What the JD emphasized

  • Track record of shipping ML systems to production (not just prototyping or publishing)

Other signals

  • building systems that bridge ML research and production evaluation
  • ship systems that run at scale on real-world driving data
  • produce metrics that block or green-light software releases
  • define how we measure whether an autonomous vehicle drives well
  • building the next generation of driving behavior evaluation