Research Engineer, Safeguards Labs

Anthropic Anthropic · AI Frontier · New York, NY · AI Research & Engineering

Research Engineer focused on AI safety, investigating novel methods for detecting misuse, strengthening model safeguards, and building evaluation methodologies for AI systems, particularly in agentic workflows. The role involves leading research projects, designing offline analyses, developing prototypes, and collaborating with production teams.

What you'd actually do

  1. Lead and contribute to research projects investigating new methods for detecting misuse of Claude, identifying malicious organizations and accounts, strengthening model safeguards, and other safety needs.
  2. Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.
  3. Develop and iterate on prototypes that could eventually feed signals into the real-time safeguards path, partnering with engineers on tech transfer.
  4. Contribute to a broader research portfolio investigating methods for detecting abusive behavior in chat-based or agentive workflows, and for training the model to robustly refrain from dangerous responses or behaviors without over-refusing.
  5. Build evaluations and methodologies for measuring whether safeguards actually work, including in agentic settings.

Skills

Required

  • Python
  • working with large datasets
  • societal impacts of AI
  • reduce real-world harm

Nice to have

  • building and training machine learning models
  • classifiers for abuse, fraud, integrity, or security applications
  • evaluation methodologies for language models
  • designing evals
  • agentic environments
  • evaluating model behavior in them
  • trust and safety
  • integrity
  • fraud detection
  • threat intelligence
  • adversarial ML
  • red teaming
  • jailbreak research
  • interpretability methods like steering vectors
  • taking research prototypes and transferring them into production systems

What the JD emphasized

  • track record of independently driving research projects from ambiguous problem statements to concrete results
  • evaluate their effectiveness
  • evaluations and methodologies for measuring whether safeguards actually work
  • evaluating model behavior in them

Other signals

  • researching novel safety methods
  • detecting misuse of Claude
  • strengthening model safeguards
  • building evaluations and methodologies for measuring whether safeguards actually work