2024 ScalingMonosemanticityExtractin

From GM-RKB
Jump to navigation Jump to search

Subject Headings: LLM Interpretability, Sparse Autoencoder.

Notes

Here are 11 bullet points summarizing the key points from the paper, each starting with the specified text:

Cited By

Quotes

Key Results

  • Sparse autoencoders produce interpretable features for large models.
  • Scaling laws can be used to guide the training of sparse autoencoders.
  • The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
  • There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
  • Features can be used to steer large models (see e.g. Influence on Behavior).
  • We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 ScalingMonosemanticityExtractinCraig Citro
Chris Olah
Shan Carter
Tom Henighan
Andy Jones
Tom Conerly
Tristan Hume
Esin Durmus
Alex Tamkin
Adly Templeton
Jonathan Marcus
Jack Lindsey
Trenton Bricken
Brian Chen
Adam Pearce
Emmanuel Ameisen
Hoagy Cunningham
Nicholas L Turner
Callum McDougall
Monte MacDiarmid
Francesco Mosconi
C. Daniel Freeman
Theodore R. Sumers
Edward Rees
Joshua Batson
Adam Jermyn
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet2024