This is a very short research note inspired by a conversation with someone at the ICML 2024 mechanistic interpretability workshop. As such, it’s missing some context that I’d typically supply. I’m exploring feature manifolds more and hope to publish something more substantial in the coming months.

I also just generally have a lot of uncertainty around how to think about this stuff! Please feel free to contact me with corrections.

---

In my recent paper, I trained sparse autoencoders (SAEs) on the first five layers of InceptionV1. One of the interesting findings was a large number of curve detector features, many more than previous work found studying individual neurons. By design, sparse autoencoders force features to be discrete directions, but it’s quite possible that this isn’t how they’re actually represented.

Feature manifolds are the idea that some features, like curve detectors, aren’t actually discrete directions but a continuous manifold of features. Potentially, it’s better to understand curve features as one manifold.

The simplest way of investigating this is just a 2D UMAP of the decoder vectors for the curve features.

Image

This does, in fact, get what we would expect - a circle - if they were represented as a manifold.

A 4D UMAP preserves much more of the local structure and so I then did that, followed by PCA to project this into 3D and 2D.

Image

The 2D projection reveals what we might expect: a circle formed by curve features of different orientations.

The 3D PCA reveals something additionally interesting. Although still circular, the curve appears to ripple.

However, the July 2024 update from the Anthropic interpretability team describes a possible motivation for rippling the manifold like this which goes something like: the model wants to linearly read features without noise but nearby points are almost parallel, meaning that representing features as nearly orthogonal isn’t useful. Instead, by rippling the manifold, nearby points have distinct angles.

At first, I wasn’t sure what to make of this result, so seeing that update was exciting! The theory makes a lot of sense, and it appears that not only would rippling the manifold be something the model might “want” to do, but it’s something that, at least when it comes to curve detectors, we’re able to observe neatly.

When considering the curve feature manifold, I’m left with a few questions:

What actually is a curve feature? What is the “true” set of curve features? Do those two questions even make sense? Does this mean sparse autoencoders, that by design force discrete features, aren’t the right way to capture the most true thing about how models represent curves?

It’s possible that the idea of a curve feature doesn’t actually capture the most fundamental truth about the model. But that doesn’t mean they aren’t an accurate and useful way of understanding InceptionV1.

It also seems quite likely that there isn’t a “true” set of features. By this I mean a set that represents the manifold completely and 2) is the only set of features that does this. But (2) mightn’t be what matters, and instead, recovering the manifold is sufficient, even if this does make the definition of features as the base computational unit of the model a little fuzzier. If our set contains every point on the manifold then making a distinction between this and the manifold itself is pretty arbitrary.1

A good SAE2 will learn a set of features that does this but this won’t be a unique property of that specific set. It seems possible that SAEs are capturing the entire truth of the curve detectors and is just doing so in a way that is makes it simpler to reason about and apply.

  1. I’m not claiming that the ideal set of curve features contains discrete directions for literally every point. SAEs of different sizes will recover different numbers of curve features and rather it is that we can add these together to represent additional curve orientations. 

  2. As an aside, I’m not totally sure of how we could actually recover both the manifolds that represent things like curves and then also more discrete things (e.g. a specific dog breed in InceptionV1). Although an interesting research question, this seems hard enough that if SAEs are representing the entirety of the manifold, not being able to do so shouldn’t invalidate them as a tool to push forward our understanding of NNs (especially, with respect to safety).