Research Engineer, Agents

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Research Engineer focused on advancing agentic AI systems, involving finetuning Claude for agentic tasks, developing tools for agents (memory, communication), prompt engineering, automated evaluation, and optimizing data mixes for model training. The role also involves creating and maintaining infrastructure for prompt iteration and testing.

What you'd actually do

  1. Finetune new capabilities into Claude that maximize Claude’s performance or ease of use on agentic tasks
  2. Ideate, develop, and compare the performance of different tools for agents (eg memory, context compression, communication architectures for agents)
  3. Systematically discover and test prompt engineering best practices for agents
  4. Develop automated techniques for designing and evaluating agentic systems
  5. Assist with automated evaluation of Claude models and prompts across the training and product lifecycle

Skills

Required

  • significant ML and software engineering experience
  • high level familiarity with the architecture and operation of large language models
  • extensive prior experience exploring and testing language model behavior
  • spent time prompting and/or building products with language models
  • good communication skills
  • interest in working with other researchers on difficult tasks

Nice to have

  • Developing complex agentic systems using LLMs
  • Large-scale RL on language models
  • Multi-agent systems

What the JD emphasized

  • project built on LLMs that showcases your skill at getting them to do complex tasks
  • design of complex agents
  • quantitative experiments with prompting
  • constructing model benchmarks
  • synthetic data generation
  • model finetuning
  • application of LLMs to a complex task
  • Developing complex agentic systems using LLMs
  • Large-scale RL on language models
  • Multi-agent systems
  • Implementing and testing a novel retrieval, tool use, sub-agent, or memory architecture for Claude
  • Finetuning Claude to maximize its performance using a particular set of agent tools (eg a read-write memory, or an inter-agent communication system)
  • Building the prompting and model orchestration for a production application backed by a language model
  • Building and testing an automatic prompt optimizer or automatic LLM-driven evaluation system for judging a prompt’s performance on a task.
  • Building a scaled model evaluation framework driven by model-based evaluation techniques.

Other signals

  • agentic systems
  • planning
  • tool use
  • memory
  • inter-agent coordination
  • finetuning
  • agent infrastructure
  • agent design