What you'd actually do

Construct quantitative benchmarks and automated evaluation frameworks (including LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use.

Create and optimize data mixes extracted from user feedback for training, fine-tuning agents to enhance performance on specific tool-use tasks.

Analyze agent behavior to identify failure modes, edge cases, and performance bottlenecks, turning these insights into actionable improvements.

Skills

Required

software development
design and architecture
testing/launching software products

Nice to have

Master’s degree or PhD in Engineering, Computer Science, or a related technical field
data structures and algorithms
technical leadership role leading project teams and setting technical direction
working in a complex, matrixed organization involving cross-functional, or cross-business projects

In this role, you will work cross-functionally with Researchers and Engineers on Agent Evals and Quality to ensure that we have the best quality of agents for GenAI model improvement and product developments. You will collaborate with agent platform and model teams to leverage user signals and metrics to improve model performance. focusing on the development, evaluation, and optimization of AI agentic systems—specifically LLM-based agents designed to perform complex, multi-step tasks and workflows.

Artificial intelligence will be one of humanity’s most transformative inventions. At Google DeepMind, we are a pioneering AI lab with exceptional interdisciplinary teams focused on advancing AI development to solve complex global challenges and accelerate high-quality product innovation for billions of users. We use our technologies for widespread public benefit and scientific discovery, ensuring safety and ethics are always our highest priority.

We are pushing the boundaries across multiple domains. Our global teams offer diverse learning opportunities and varied career pathways for those driven to achieve exceptional results through collective effort.

The US base salary range for this full-time position is $262,000-$365,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities

Construct quantitative benchmarks and automated evaluation frameworks (including LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use.
Create and optimize data mixes extracted from user feedback for training, fine-tuning agents to enhance performance on specific tool-use tasks.
Analyze agent behavior to identify failure modes, edge cases, and performance bottlenecks, turning these insights into actionable improvements.

Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent practical experience.
8 years of experience in software development.
5 years of experience with design and architecture; and testing/launching software products.

Preferred qualifications:

Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
8 years of experience with data structures and algorithms.
5 years of experience in a technical leadership role leading project teams and setting technical direction.
3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities

Construct quantitative benchmarks and automated evaluation frameworks (including LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use.
Create and optimize data mixes extracted from user feedback for training, fine-tuning agents to enhance performance on specific tool-use tasks.
Analyze agent behavior to identify failure modes, edge cases, and performance bottlenecks, turning these insights into actionable improvements.

Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent practical experience.
8 years of experience in software development.
5 years of experience with design and architecture; and testing/launching software products.

Preferred qualifications:

Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
8 years of experience with data structures and algorithms.
5 years of experience in a technical leadership role leading project teams and setting technical direction.
3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.

Senior Staff Research Engineer, Deepmind

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Minimum qualifications:

Preferred qualifications:

Responsibilities

Qualifications

Minimum qualifications:

Preferred qualifications: