Sparse Autoencoder

Sparse Autoencoders are a technique for investigating Mechanistic Interpretability and the idea of Superposition and Polysemanticity.

The key idea is to use a separate learning algorithm to extract model features that often align with human-understandable concepts. Researchers can then amplify or suppress these features to observe their effect on the model’s behavior.

The primary operation is to train a sparse autoencoder on the activations of a specific layer in a neural network. The autoencoder learns to reconstruct the original activations while enforcing sparsity, meaning that only a small number of neurons are active at any given time. This encourages the autoencoder to learn more interpretable features.

This can be done by taking a matrix $M$ of activations from a specific layer in the model, where each row corresponds to a different input example, and each column corresponds to a different neuron in that layer. The sparse autoencoder then learns a weight matrix $W$ such that:

\hat{M} = M W^{T} W

The loss function for training the sparse autoencoder typically includes a reconstruction loss term and a sparsity penalty term, such as:

Loss = ∣∣ X - \hat{X} ∣ ∣^{2} + λ \sum ∣ h_{i} ∣

Graph View

Sparse Autoencoder

Backlinks