How Worried Should We Be About Non-Linear Features?

Liv Gorton

“Non-linear representations” have become a catch-all objection to mechanistic interpretability work. The concern is worth taking seriously, but as typically stated, it collapses together cases with completely different implications and likelihoods.

Debates about non-linear features have often been unconstructive for several reasons. One common issue is that people have used the word “linear feature” to mean multiple different things, leading to miscommunication. Given this, the first contribution of this post is to try to clarify this. I follow the definition that matches the standard mathematical definition of linearity (see e.g., this clarified definition), not the meaning the term sometimes takes on in debate.¹

I then turn to the intuitions for why features should be linear. Many practitioners have strong intuitions that features should be linear: that they’re the only reasonable way to design circuits in a neural network. But when one tries to formalise these arguments, it gets a little messier. The arguments turn out to be more like “once you’re doing something in the general vicinity of a circuit with linear features, there are reasons you should expect actual linear features.” The space of all conceivable neural networks that don’t decompose into linear features is vast, and it’s hard for arguments – at least those I’m aware of – to confidently exclude some exotic possibility.

So the second contribution of this post is to identify certain constrained cases of non-linear neural networks, and then provide semi-formal arguments that in those cases, there are strong reasons to expect linear features.²

The first case I consider is a model that contains a mixture of linear and non-linear features that are “integrated” in various senses (e.g., sequentially or in parallel). I present arguments for why non-linear features are problematic in these setups. Since we have strong empirical evidence that neural networks contain many linear features, this also seems like a significant case against non-linear features.

I then consider a second case where models are composed of “write-linear” (but not necessarily “read-linear”) features. This is a broader category than traditional linear features. I show that for this case, there are only three possibilities: orthogonal features, “angular superposition”, and “magnitude superposition”, where angular and magnitude superposition aren’t mutually exclusive. These are each known to the interpretability community (angular superposition is just regular superposition; magnitude superposition overlaps with Csordás et al.’s “onion features”). What I believe is novel is characterising these three as exhaustive for write-linear features.

Only magnitude superposition creates true non-read-linear features³, so I then examine when we should expect it. There are good reasons to think it should be very rare.

More broadly, many of the arguments I make seem extendable in various ways. None obviously generalises to rule out every exotic possibility, but collectively they seem quite powerful. The two restricted cases I consider are ways to formalise them, but shouldn’t be taken as the full limits of the existing arguments. Instead, there’s a wide range of theoretical intuitions that argue against non-linear features.

To be totally honest, despite significant effort, I have not been able to imagine any kind of non-linear feature setup that is remotely plausible to me, and that doesn’t have many of these arguments work against it.

What is a Linear Feature?

Let \(\mathbf{x} \in \mathbb{R}^d\) denote the residual stream vector at a given layer, where \(d\) is the model’s hidden dimension.

We say the residual stream admits a linear feature decomposition if there exists a set of fixed feature directions \(\mathbf{v}_i \in \mathbb{R}^d\) and corresponding scalar activations \(a_i \in \mathbb{R}\) such that:

\[\mathbf{x} = \sum_{i=1}^{N} a_i \mathbf{v}_i\]

When features are approximately orthogonal, the activation can be recovered as \(a_i \approx \mathbf{v}_i^\top \mathbf{x}\) (assuming unit-norm directions). When \(N > d\), perfect orthogonality is impossible, and this read becomes noisy.

This decomposition encodes two properties:

Composition as addition. The presence of multiple features is represented by vector addition.
Intensity as scaling. The strength of feature \(i\) is encoded in the magnitude \(\lvert a_i \rvert\).

A feature that violates either property is non-linear.

Two Aspects of “Linear”

It’s helpful to separate linearity into two operations:

Writing linearly. Feature \(i\) enters the residual stream by adding \(a_i \mathbf{v}_i\). Its presence is encoded through vector addition.
Reading linearly. Feature \(i\) is approximately recovered via \(\hat{a}_i = \mathbf{w}_i^\top \mathbf{x}\).⁴ A linear projection extracts the activation. These are logically independent. You could have linear writes with non-linear reads, or vice versa.

Linear and non-linear features cannot gracefully co-exist

Converting Between Linear and Non-Linear Features Wastes Capacity

Because neural networks link everything together with linear transforms, it’s very easy for linear features to sequentially interact. A linear feature (and in fact, any linear combination of linear features) can be read out of the residual stream by any component. No distinct “reading” operation is needed; it’s fused into any operation, requiring no additional work.

The situation is different when linear and non-linear features interact. If a feature isn’t read-linear, some additional non-linear read step must be done before it can be used in a downstream linear feature! Conversely, writing from a linear feature to a non-linear feature also requires some kind of non-linear circuit.

Both of these operations require MLP capacity.⁵ And any MLP capacity dedicated to these reads and writes is MLP capacity not being used to do more useful work.

What about non-linear features interacting with downstream non-linear features? It’s less clear how the argument goes through here, because it’s hard to imagine all non-linear constructions. But it’s hard to imagine this not also requiring MLP capacity.

Learning Dynamics Favour Linear Reads

Linear reads have constant Jacobians: for \(a_i = \mathbf{w}^\top\mathbf{x}\), the gradient is always \(\mathbf{w}\). For a non-linear read \(a_i(\mathbf{x}) = g(\mathbf{x})\), the gradient \(\nabla_\mathbf{x} g\) depends on \(\mathbf{x}\) itself.

This matters because of how gradients propagate to upstream components. With a linear read, every component contributing to a feature gets pushed in the same direction, which can independently converge toward a shared representation. With a non-linear read, the gradient direction depends on what else is active in the residual stream, which can cause the same conceptual objective to send different (even opposite) learning signals depending on context.⁶

The implication is that the entire upstream circuit for some feature must coordinate under a shifting gradient landscape. The more complex the circuit, the worse this gets.

Parallel Existence of Linear and Non-Linear Features Can Cause Problems

If we expect neural networks to contain both linear and non-linear features, then, unless they lie in orthogonal subspaces, there’s a significant risk of interference.

With linear features, even with angular superposition, interference interacts gracefully with addition. When reading feature \(i\) via \(\mathbf{w}_i^\top\mathbf{x}\), the interference from other features decomposes as a sum, each contributing independently according to its active and geometric relationship to \(\mathbf{w}_i\). Total interference can be minimised through geometric choices.

Non-linear reads destroy this structure. For a non-linear read \(g(\mathbf{x})\), the feature’s contribution and the interference interact inside \(g\) and cannot be separated additively.

For a linear read, the Jacobian is constant, so we can choose a direction approximately orthogonal to likely interference and use it everywhere. For a non-linear read, the Jacobian depends on \(\mathbf{x}\). A direction orthogonal to interference when the feature is weak may rotate into the interference subspace when the feature is strong.

The issue is that if the Jacobian ever fails to be orthogonal to linear features, the non-linear feature can suddenly become very sensitive to unrelated features. In principle, there are ways to avoid this, such as restricting linear and non-linear features to different subspaces. But this imposes stringent constraints.

Additionally, with linear reads, noise from different features sums independently. Non-linearity introduces cross-terms: the presence of some feature changes how the read responds to another. For certain classes of non-linear features, this could be much worse and greatly exaggerate things.

Linear Writes as a Crux

At the level of a single layer, the write to the residual stream is always additive: \(\mathbf{x}_{l+1} = \mathbf{x}_l + \Delta\mathbf{x}_l\). Information enters via vector addition or not at all. Additivity holds at finer levels of granularity too, e.g., within an MLP, individual neurons contribute vectors that sum to form the layer’s output.⁷

This doesn’t immediately settle whether features are written linearly. The universal approximation theorem tells us that enough neurons can do any non-linear operation, even though they all independently add to the output. There are also counterarguments. Such schemes seem exotic and might need enormous numbers of neurons. These setups can also always be turned into a multi-step circuit with linear features (e.g., by treating the neurons as features) – I haven’t seen any examples that I don’t think are better understood this way. But I don’t intend to claim that it’s impossible.

The goal of this post is not to resolve that question. Instead, I want to propose linear writes as a point where, if we could agree, much else would follow.

When people discuss exotic non-linear representations, they are often implicitly rejecting linear writes. That rejection tends to remain tacit, making it hard to locate the real disagreement. (Indeed, despite this, they usually point to Csordás et al. (2024) as evidence, which does have linear writes!) Making the crux explicit helps: if you think such features exist, what’s your model of how they enter the residual stream?

Conversely, conditioning on linear writes turns out to be very constraining. If features are written linearly, we can enumerate the possible representational strategies (there are only three), analyse their trade-offs, and make predictions about which should dominate. That’s what the rest of this post does.

The Trichotomy of Representations: What Linear Writes Actually Buy You

If you accept that the residual stream is fundamentally additive, that is, that features enter via \(\mathbf{x} + \sum_ja_j\mathbf{v}_j\), the natural question arises: what options does a decoder actually have from recovering a specific \(a_i\) from this mixture?

A vector in Euclidean space only has two intrinsic properties: its direction and its magnitude. Any decoding strategy must ultimately exploit one or both of these, and there are only three possibilities for said strategy.

Type 1: Orthogonality

The cleanest situation is when a feature direction \(\mathbf{v}_i\) is orthogonal to everything else. If \(\mathbf{v}_i^\top\mathbf{v}_j = 0\) for all \(j\neq i\), then recovering a feature is trivial:\[\mathbf{v}_i^\top\mathbf{x} = a_i\]

All the interference projects to zero. The catch, of course, is that the model can only fit \(d\) mutually orthogonal directions in \(\mathbb{R}^d\). This capacity limitation then motivates other types of representation.

Type 2: Angular Superposition

When the model needs more than \(d\) features, the directions have to overlap. But it can choose nearly orthogonal directions, which minimises the interference between them.

If we probe with \(\mathbf{v}_i\), we get:

\[\mathbf{v}_i^\top\mathbf{x} = a_i + \underbrace{\sum_{j\neq i} a_j (\mathbf{v}_i^\top\mathbf{v}_j)}_{\text{interference}}\]

The strategy is to make the interference small enough that the signal is still recoverable. This works when: (a) the dot products \(\mathbf{v}_i^\top\mathbf{v}_j\) are small (low coherence), and (b) not too many other features are active at once (sparsity). Under these conditions, the signal \(a_i\) dominates the noise floor.

This is the regime most interpretability work implicitly assumes. Features have directions; those directions aren’t perfectly orthogonal, but techniques like linear probes still work because the interference stays below some tolerable threshold.

Type 3: Magnitude Superposition

Sometimes, angular separation is unavailable or insufficient. The extreme case: two features share the same direction (\(\mathbf{v}_i = \mathbf{v}_j\)). How could a decoder possibly tell them apart?

The only remaining degree of freedom is magnitude. The decoder observes a single scalar \(u = \mathbf{v}^\top \mathbf{x}\) and must infer which features are present and at what intensity.

This is a non-linear representation: despite being written linearly, it violates the “intensity as scaling” property. Decoding requires non-linear operations, and the scheme imposes structural constraints that angular superposition avoids. We return to these limitations in detail below.

Why Angular Superposition Should Be Expected in Practice

Non-Linear Decoding is a Tax on MLP Capacity

Earlier, we observed that MLP capacity must be wasted when linear and non-linear features interact. It’s worth noting that this is always an issue for non-read-linear features interacting with write-linear features. To construct a linear direction from a non-read linear feature, extra MLP capacity must be wasted. So that argument also applies here.

Magnitude Superposition Has Significant Problems with Co-Activation

Angular superposition preserves direction as identity and magnitude as intensity. Two features can co-activate with arbitrary intensities, and although there is some interference, the information is still approximately recoverable.

Magnitude superposition faces a more rigid constraint. Consider features sharing a single direction \(\mathbf{v}\), so we observe only the scalar \(u=\mathbf{v}^\top\mathbf{x}\).

If only one of the \(n\) features can be active at a time, we can partition the real line into \(n\) intervals and decode via thresholds. Each feature can even carry continuous intensity within its interval (although it must never have small pre-write values). Additionally, the outermost feature for a given direction will be read linear.

However, co-activating features introduce significant problems. Co-activation can work if each feature is binary (i.e., encodes only the presence or absence of the feature). Assign each feature a distinct power of two: feature \(i\) contributes \(2^i\) when active. The sum uniquely identifies any subset, and a sequential peeling operation recovers all active features: check if \(u\geq 2^n\), subtract and repeat. This imposes the additional constraint that the information must be accessed in a fixed order (or likely use massive amounts of MLP capacity to do it in one step via the universal function approximation theorem). The largest feature can be read with a linear and binarised step.

Note that this scheme requires features to be precisely on or off. Outer shells must be exponentially larger than inner ones, so any uncertainty in an outer shell’s activation would be indistinguishable from a large activation of an inner shell.

Co-active continuous features are impossible. Suppose features \(A\) and \(B\) each take values in [0, 1]. We observe \(u = a_A + 2\cdot a_B \in [0, 3]\), a single scalar. But we need to recover two degrees of freedom. This is impossible without additional structure, as there are infinitely many \((a_A, b_B)\) pairs consistent with any observed \(u\). Continuous features contribute unknown amounts that cannot be disentangled from a single observation. Continuous features can be represented if we guarantee that everything sharing that direction is mutually exclusive, but that is quite the constraint to impose.

Magnitude Superposition is Extremely Sensitive to Noise

In the co-active features case, the exponential shell structure that makes magnitude superposition work also makes it fragile. Consider what happens when noise enters the system.

Small amounts of noise in outer shell features get amplified when reading inner shell features. If the outer shell operates at magnitude ~100 and the inner shell at ~1, then 1% noise in the outer shell (magnitude ~1) is the same scale as the entire inner shell signal. Any downstream circuit trying to read the inner shell must somehow filter out noise that’s comparable to or larger than what it’s looking for.

The problem runs in the other direction, too. If inner shell features need to influence outer shell features in later layers, the weights must be enormous: scaling up from ~1 to ~100. But large weights amplify any noise present at the input, creating further instability.

Finally, there’s a discontinuity at shell boundaries. The transition from “large activation of shell \(k\)” to “small activation of shell \(k+1\)” is a sharp threshold, and noise near that boundary would cause dramatic oscillation between two completely different feature interpretations. Unlike angular superposition, where noise causes graceful degradation, magnitude superposition has cliff edges.

Angular Superposition Scales; Magnitude Superposition Doesn’t

For every feature the model represents, there is both a benefit and a cost to the loss; the latter being due to interference. At some point, we expect that the cost of interference will exceed the benefits of adding it, and thus the model will stop representing additional features.

Imagine packing progressively less useful features into the model. Magnitude superposition has a high cost (in both interference and, indirectly, through requiring additional MLP capacity), which is relatively fixed with model size. Let’s say there are \(N_{magnitude useful features}\), which would be worth representing in this way.

If angular superposition can represent at least this many features at a lower interference cost per feature than magnitude superposition, the model should use only angular superposition. One of the remarkable things about angular superposition is that for any non-zero given interference cost, angular superposition can represent exponentially many features as residual stream size increases. This means there must be a model size at which magnitude superposition would no longer be worthwhile, and in fact, given the exponential, it probably shouldn’t be that large.

(This is consistent with Csordás et al. (2024) only observing magnitude superposition in small models.)

Conclusion

Debates about non-linear features have often stalled because “maybe features aren’t linear” is too vague to engage with. This post tries to make progress by getting specific.

One core observation is that linear writes are a natural crux. The transformer architecture is fundamentally additive, and while the universal approximation theorem means you can’t rule out exotic schemes in principle, such constructions would require unusual coordination across many neurons. If someone believes non-linear writes are common, the burden should be on them to explain the mechanism.

Conditioning on linear writes turns out to be surprisingly constraining. There are exactly three representational strategies: orthogonal features, angular superposition, and magnitude superposition. Only magnitude superposition produces genuinely non-linear features, and it faces serious obstacles — fragility to noise, inability to handle co-active continuous features, and poor scaling properties. While not definitive, they do suggest it should be rare and confined to capacity-starved models.⁸

There’s also the question of co-existence. Even if some non-linear features exist, they must somehow live alongside the linear features we have strong evidence for. This creates additional problems: converting between the two wastes MLP capacity, non-linear reads interfere ungracefully with linear features in the residual stream, and learning dynamics become harder to coordinate. A world with widespread non-linear features isn’t just one where it is theoretically possible, but also one where these coordination costs are somehow worth paying.

None of this proves that non-linear features don’t exist. But the arguments here are not merely “linear methods work well empirically.” They’re structural: the architecture favours additive composition, magnitude superposition has identifiable failure modes, and mixed linear/non-linear systems face coordination problems that pure linear systems avoid.

If you’re skeptical of linear features, I’d welcome a specific counter-proposal. What’s the mechanism? How does it enter the residual stream? What are the testable predictions? That’s a conversation we could actually have.

References

Tamkin, A., Taufeeque, M., & Goodman, N. D. (2023). Codebook Features: Sparse and Discrete Interpretability for Neural Networks. https://arxiv.org/abs/2310.17230
Csordás, R., Potts, C., Manning, C. D., & Geiger, A. (2024). Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations. https://arxiv.org/abs/2408.10920
Olah, C. (2025). What is a Linear Representation? What is a Multidimensional Feature? In Transformer Circuits Thread.

As an example, multidimensional features that live in linear subspaces are still linear even though the feature might trace out a different, non-line-based structure internally. ↩
It’s likely that some of this overlaps with intuitions others have developed. The MLP capacity argument is one I’ve heard mentioned informally; the others may have precedents I’m unaware of, but I haven’t encountered them. ↩
Whether something is read-linear is also not a binary. The features in Csordás et al. are read-linear at the time they are accessed. ↩
With angular superposition, the linear read includes small amounts of noise from interference with other features. Downstream circuits may threshold or apply a nonlinearity to clean this up. This doesn’t change the fundamental character of the representation—the information is still carried in a linear direction—but it’s worth noting that “linear read” often means “linear projection plus minor cleanup” rather than a pure dot product. ↩
Attention is particularly affected by this. Attention heads can only read via linear operations (QKV) and so any non-linear feature must first be decoded by an MLP into a linear direction before attention can operate on it directly (this is assuming we want attention to be able to operate directly on directions with minimal interference but presumably someone will object and say this is an unreasonable assumption). ↩
As a concrete example, let \(a_i(\mathbf{x}) = x_1 x_2\). Then \(\nabla_{x_1} a_i = x_2\). If we want to increase \(a_i\), the gradient pushes \(x_1\) up when \(x_2 > 0\) and down when \(x_2 < 0\)—the sign flips depending on context despite the same conceptual objective. ↩
This suggests the constraints following from linear writes may extend beyond transformers to architectures with similar feedforward structure. ↩
If magnitude superposition does turn out to matter in practice, Tamkin et al. (2023) suggests a promising direction for surfacing such features. ↩