Polysemanticity refers for the tendancy of neural networks, specifically large language models, to have individual neurons that respond to multiple unrelated features or concepts. This phenomenon is often observed in deep learning models, where a single neuron may activate for different types of inputs that do not share an obvious connection.

Neural networks often contain “polysemantic neurons” that respond to multiple unrelated inputs. Polysemanticity is what we’d expect to observe if features were not aligned with a neuron. In the superposition hypothesis, features can’t align with the basis because the model embeds more features than there are neurons.