Sparse Crosscoders for Cross-Layer Features and Model Diffing

submited by
Style Pass
2024-10-25 21:00:04

This note introduces sparse crosscoders , a variant of sparse autoencoders (e.g.  ) or transcoders for understanding models in superposition . Where autoencoders encode and predict activations at a single layer, and transcoders use activations from one layer to predict the next, a crosscoder reads and writes to multiple layers. Crosscoders produce shared features across layers and even models . They have several applications:

This note will cover some theoretical examples motivating crosscoders, and then present preliminary experiments applying them to cross-layer superposition and model diffing. We also briefly discuss the theory of how crosscoders might simplify circuit analysis, but leave results on this for a future update.

According to the superposition hypothesis, neural networks represent more features than they have neurons by allowing features to be non-orthogonal . One consequence of this is that most features are represented by linear combinations of multiple neurons:

At first blush, the idea that this kind of superposition might be spread across layers might seem strange. But if we think about it carefully, it's actually relatively natural in the context of a transformer with a reasonable number of layers.

Leave a Comment