Prompt Engineer, Agent Prompts & Evals

Anthropic Anthropic · AI Frontier · San Francisco, CA · Engineering & Design - Product

This role focuses on prompt engineering and evaluation development for AI-first products and features, bridging model capabilities with user experience. It involves designing, testing, and optimizing prompts, building evaluation suites, supporting model launches, and contributing to prompt development frameworks. The role requires strong software engineering skills, LLM and prompt engineering experience, and understanding of evaluation methodologies.

What you'd actually do

  1. Design, test, and optimize system prompts and feature-specific prompts that shape Claude’s behavior across consumer and API products.
  2. Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.
  3. Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.
  4. Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.
  5. Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.

Skills

Required

  • Python
  • LLMs
  • prompt engineering
  • evaluation methodologies
  • AI systems
  • written and verbal communication
  • project management
  • version control
  • CI/CD
  • modern software development practices

Nice to have

  • Claude or other frontier AI models
  • machine learning
  • NLP
  • A/B testing
  • experimentation frameworks
  • AI safety and alignment
  • AI/ML workflow tools and infrastructure

What the JD emphasized

  • 5+ years of software engineering experience with Python or similar languages
  • Demonstrated experience with LLMs and prompt engineering
  • Strong understanding of evaluation methodologies and metrics for AI systems
  • Experience with Claude or other frontier AI models in production settings
  • Track record of improving AI system performance through systematic evaluation and iteration

Other signals

  • prompt engineering
  • evaluation development
  • model launch support
  • cross-functional collaboration