Research Engineer, Benchmarking, Robotics, Deepmind

Google Google · Big Tech · Mountain View, CA +1

Research Engineer focused on benchmarking foundation models for robotics. The role involves designing evaluation protocols, tooling, and frameworks to assess robot policies in both simulated and real-world environments. Key responsibilities include building infrastructure for large-scale evaluation, root-causing policy failures, establishing evaluation criteria for model releases, and innovating on hardware evaluation processes. The goal is to provide data-driven insights into technological readiness for robotics development.

What you'd actually do

  1. Design, implement, and maintain scalable, robust frameworks to enable large-scale evaluation of robot policies across offline open-loop testing and real-world hardware evaluations.
  2. Partner with researchers to design the content of various benchmarks in order to maximize evaluation signal and stress-test model capabilities.
  3. Build diagnostic and visualization tools that allow the team to easily root-cause policy failures and track performance regressions.
  4. Establish evaluation criteria for model releases and own the stability and benchmarking of models slated for critical demos.
  5. Innovate on how to make real-world hardware evaluation faster, more reproducible, and less reliant on manual human intervention.

Skills

Required

  • Python
  • machine learning tools and algorithms
  • deploying LLMs/VLMs and deep learning models
  • software engineering
  • AI/ML engineering

Nice to have

  • ROS/ROS2
  • on-device deployment constraints (Jetson, TPU)
  • managing large-scale multimodal datasets
  • time-series telemetry data
  • building automated pipelines for hardware-in-the-loop testing
  • operational realities of modern vision-language-action (VLA) models
  • behavior cloning policies
  • embodied AI

What the JD emphasized

  • benchmarking foundation models for robotics
  • design evaluation protocols, tooling, and frameworks
  • extract meaningful signals from the messiness of physical policy execution
  • build the infrastructure that allows the engineering team to effectively hillclimb
  • gives leadership a clear, data-driven understanding of technological readiness

Other signals

  • benchmarking foundation models for robotics
  • design evaluation protocols, tooling, and frameworks
  • extract meaningful signals from the messiness of physical policy execution
  • build the infrastructure that allows the engineering team to effectively hillclimb
  • gives leadership a clear, data-driven understanding of technological readiness