Senior Staff Research Engineer, Deepmind

Google Google · Big Tech · Mountain View, CA +1

Senior Staff Research Engineer at Google DeepMind focused on Agent Evals and Quality for GenAI model improvement and product development. The role involves developing, evaluating, and optimizing LLM-based agents for complex, multi-step tasks. Responsibilities include constructing quantitative benchmarks and automated evaluation frameworks (e.g., LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use, as well as creating and optimizing data mixes from user feedback for training and fine-tuning agents. The role also requires analyzing agent behavior to identify failure modes and performance bottlenecks.

What you'd actually do

  1. Construct quantitative benchmarks and automated evaluation frameworks (including LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use.
  2. Create and optimize data mixes extracted from user feedback for training, fine-tuning agents to enhance performance on specific tool-use tasks.
  3. Analyze agent behavior to identify failure modes, edge cases, and performance bottlenecks, turning these insights into actionable improvements.

Skills

Required

  • software development
  • design and architecture
  • testing/launching software products

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • data structures and algorithms
  • technical leadership role leading project teams and setting technical direction
  • working in a complex, matrixed organization involving cross-functional, or cross-business projects

What the JD emphasized

  • Agent Evals and Quality
  • LLM-based agents
  • multi-step tasks and workflows
  • quantitative benchmarks
  • automated evaluation frameworks
  • LLM-as-a-judge
  • agent capabilities
  • reasoning, planning, and tool use
  • training, fine-tuning agents
  • agent behavior
  • failure modes
  • performance bottlenecks

Other signals

  • Agent Evals and Quality
  • LLM-based agents
  • multi-step tasks and workflows
  • quantitative benchmarks
  • automated evaluation frameworks
  • LLM-as-a-judge
  • agent capabilities
  • reasoning, planning, and tool use
  • training, fine-tuning agents
  • agent behavior
  • failure modes
  • performance bottlenecks