What you'd actually do

Design, implement, and maintain scalable, robust frameworks to enable large-scale evaluation of robot policies across offline open-loop testing and real-world hardware evaluations.

Partner with researchers to design the content of various benchmarks in order to maximize evaluation signal and stress-test model capabilities.

Build diagnostic and visualization tools that allow the team to easily root-cause policy failures and track performance regressions.

Establish evaluation criteria for model releases and own the stability and benchmarking of models slated for critical demos.

Innovate on how to make real-world hardware evaluation faster, more reproducible, and less reliant on manual human intervention.

Skills

Required

Python
machine learning tools and algorithms
deploying LLMs/VLMs and deep learning models
software engineering
AI/ML engineering

Nice to have

ROS/ROS2
on-device deployment constraints (Jetson, TPU)
managing large-scale multimodal datasets
time-series telemetry data
building automated pipelines for hardware-in-the-loop testing
operational realities of modern vision-language-action (VLA) models
behavior cloning policies
embodied AI

What the JD emphasized

benchmarking foundation models for robotics

design evaluation protocols, tooling, and frameworks

extract meaningful signals from the messiness of physical policy execution

build the infrastructure that allows the engineering team to effectively hillclimb

gives leadership a clear, data-driven understanding of technological readiness

Other signals

benchmarking foundation models for robotics

design evaluation protocols, tooling, and frameworks

extract meaningful signals from the messiness of physical policy execution

build the infrastructure that allows the engineering team to effectively hillclimb

gives leadership a clear, data-driven understanding of technological readiness

At Google, research-focused Software Engineers are embedded throughout the company, allowing them to setup large-scale tests and deploy promising ideas quickly and broadly. Ideas may come from internal projects as well as from collaborations with research programs at partner universities and technical institutes all over the world.

From creating experiments and prototyping implementations to designing new architectures, engineers work on real-world problems including artificial intelligence, data mining, natural language processing, hardware and software performance analysis, improving compilers for mobile platforms, as well as core search and much more. But you stay connected to your research roots as an active contributor to the wider research community by partnering with universities and publishing papers.

Our mission is to bring advanced AI into the physical realm by building generalist robots that perceive, reason, and act naturally alongside humans.

As a Research Engineer, you will manage the practical challenges of benchmarking foundation models for robotics. You will have an understanding of how modern robotics foundation models work and where they currently fall short. Your mission is to design evaluation protocols, tooling, and frameworks that extract meaningful signals from the messiness of physical policy execution. You will build the infrastructure that allows the engineering team to effectively hillclimb and gives leadership a clear, data-driven understanding of technological readiness.

Artificial intelligence will be one of humanity’s most transformative inventions. At DeepMind, we are a pioneering AI lab with exceptional interdisciplinary teams focused on advancing AI development to solve complex global challenges and accelerate high-quality product innovation for billions of users. We use our technologies for widespread public benefit and scientific discovery, ensuring safety and ethics are always our highest priority.

We are pushing the boundaries across multiple domains. Our global teams offer learning opportunities and varied career pathways for those driven to achieve exceptional results through collective effort.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $147000 - $211000 (USD) + 15% bonus target + equity + benefits

Learn more about benefits at Google.

Responsibilities

Design, implement, and maintain scalable, robust frameworks to enable large-scale evaluation of robot policies across offline open-loop testing and real-world hardware evaluations.
Partner with researchers to design the content of various benchmarks in order to maximize evaluation signal and stress-test model capabilities.
Build diagnostic and visualization tools that allow the team to easily root-cause policy failures and track performance regressions.
Establish evaluation criteria for model releases and own the stability and benchmarking of models slated for critical demos.
Innovate on how to make real-world hardware evaluation faster, more reproducible, and less reliant on manual human intervention.

Qualifications

Minimum qualifications:

Bachelor’s degree in Computer Science, Robotics, or equivalent practical experience.
2 years of experience with machine learning tools and algorithms, specifically deploying LLMs/VLMs and deep learning models.
Experience in a technical role (software engineering, AI/ML engineering, or solutions architecture).
Experience with Python, and with modern AI-assisted development tools to accelerate prototyping.

Preferred qualifications:

Experience with ROS/ROS2, or on-device deployment constraints (Jetson, TPU).
Experience managing large-scale multimodal datasets, time-series telemetry data, or building automated pipelines for hardware-in-the-loop testing.
Familiarity with the operational realities of modern vision-language-action (VLA) models or behavior cloning policies and their common pitfalls like task overfitting.
A deep-seated interest in the future of embodied AI and a desire to build the testing bedrock for robotics development.