Favourite Papers of 2025
There were too many good papers this year to include in a single list so I have no intention of this being comprehensive. This is what I could think of on the spot and I might add more over the next few days.
(Sorted by publication date.)
-
Language Models Use Trigonometry to Do Addition (Kantamneni & Tegmark)
-
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models (Fel et al.)
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (Betley et al.)
-
Auditing language models for hidden objectives (Anthropic)
-
When Models Manipulate Manifolds: The Geometry of a Counting Task (Anthropic): This paper felt nostalgic. Reminiscent of the early mechanistic interpretability work on simple tasks.
-
Circuit Tracing: Revealing Computational Graphs in Language Models & On the Biology of a Large Language Model (Anthropic)
-
The Geometry of Self-Verification in a Task-Specific Reasoning Model (Lee et al.)
-
Embryology of a Language Model (Wang et al.)
-
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.): Being able to improve alignment more specifically at train time seems really important and also quite challenging to do without safety downsides. This is a neat approach.
-
Adversarial Attacks Leverage Interference Between Features in Superposition (Stevinson et al.): I’ve been interested in this connection since Elhage et al. (2022). I did some work on it this year and I’m excited that others are exploring it too!