2024 ScalingMonosemanticityExtractin
Jump to navigation
Jump to search
- (Templeton et al., 2024) ⇒ Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. (2024). “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” In: Circuits Updates.
Subject Headings: LLM Interpretability, Sparse Autoencoder.
Notes
Here are 11 bullet points summarizing the key points from the paper, each starting with the specified text:
- The paper used sparse autoencoders to extract interpretable features from the residual stream activations of Anthropic's Claude 3 Sonnet language model.
- The paper scaled up the sparse autoencoders to extract millions of features from this large model, aided by scaling laws.
- The paper found that the extracted features were highly interpretable, multilingual, multimodal, and able to generalize between concrete and abstract concepts.
- The paper explored local neighborhoods of features and revealed semantically related clusters of features.
- The paper identified many categories of interpretable features, including features for famous people, countries/cities, code syntax, list positions, and more.
- The paper used feature attribution and ablation to show how the features serve as interpretable computational intermediates in the model's processing.
- The paper discovered many features relevant to AI safety, including features related to deception, sycophancy, dangerous/criminal content, and more, and showed that activating these features could steer the model's behavior.
- The paper demonstrates the largest-scale application of interpretability techniques to date, indicating that interpretability can meaningfully scale to industry models.
- The paper rigorously establishes the interpretability and causal influence of the features through specificity analysis, multilingual/multimodal generalization, attribution, and feature steering experiments.
- The paper highlights limitations such as inability to find all features, interference between features, computational cost, and reliance on linear models of superposition, suggesting that fundamental advances may be needed.
- The paper marks substantial progress in mechanistic interpretability and its potential for aiding AI safety and transparency, while also laying out the significant open challenges remaining in the field.
Cited By
Quotes
Key Results
- Sparse autoencoders produce interpretable features for large models.
- Scaling laws can be used to guide the training of sparse autoencoders.
- The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
- There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
- Features can be used to steer large models (see e.g. Influence on Behavior).
- We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
References
;