Research Scientist, Mechanistic Interpretability, Special Projects

Google Google · Big Tech · Mountain View, CA +1

Research Scientist focused on mechanistic interpretability of large language models to ensure safety, alignment, and reliability. This role involves exploring emerging interpretability methods, developing open-source infrastructure, performing causal validation, and publishing findings. The scientist will also write code for experiments on distributed compute clusters.

What you'd actually do

  1. Guide and co-guide research projects exploring emerging mechanistic interpretability methods, including dictionary learning architectures (e.g., multitoken transcoders, Matryoshka sparse autoencoders), patchscopes, and agentic interpretability.
  2. Design, develop, and maintain open-source infrastructure and evaluation suites (similar to SAEBench or the dictionary_learning library) to accelerate community and internal research.
  3. Perform causal validation of discovered features and circuits using activation patching and feature steering to mitigate undesired behaviors like hallucinations or hidden objectives.
  4. Write and present papers for machine learning conferences (e.g., NeurIPS, ICML) and author technical blog posts to communicate concepts to the broader artificial intelligence safety community.
  5. Act as both a scientist and an engineer, writing code to run experiments on distributed compute clusters.

Skills

Required

  • PhD in Computer Science, a related field, or equivalent practical experience.
  • Experience building machine learning solutions, utilizing various machine learning architectures (e.g., deep learning, LSTMs, convolutional networks) and open-source frameworks (e.g., TensorFlow, PyTorch).
  • Experience in Python programming.
  • One or more scientific publication submissions for conferences, journals, or public repositories (e.g., CVPR, ICCV, NeurIPS, ICML, ICLR).

Nice to have

  • 2 years of coding experience.
  • 1 year of experience managing and initiating research agendas.
  • Experience designing multi-modal, self-supervised pre-training tasks (e.g., contrastive learning, masked autoencoders) to improve data efficiency and manage sparse signals.

What the JD emphasized

  • mechanistic interpretability
  • safety
  • alignment
  • reliability
  • scientific publication submissions for conferences, journals, or public repositories

Other signals

  • mechanistic interpretability
  • reverse-engineer
  • safety
  • alignment
  • reliability
  • compositional and structural mechanisms