T-distributed stochastic neighbor embedding

T-sne is a Latent Space visualization / dimensionality reduction technique particularly well-suited for high-dimensional data. It focuses on preserving local structures in the data, making it effective for visualizing clusters and patterns.

In Layman’s terms, it does this by computing the nearest neighbors of each point in the high dimensional space, and attempting to fine tune a model in the low dimensional space that matches these distances, such that points that were close in high-dim space remain close in low-dim space. The main difference between this (SNE) and t-SNE is that t-SNE uses a heavy-tailed distribution (Student t-distribution) in the low-dimensional space to better handle the “crowding problem”, where points tend to cluster too closely together. In addition, it makes the algorithm more efficient to compute(however, still extremely computationally expensive for large datasets).

T-SNE applied to a 19th century word embeddings dataset, showing how words with similar meanings cluster together.

Formulation

High-Dim Similarities

T-SNE first computes pairwise similarities in the original high-dimensional space using a Gaussian distribution. The conditional probability that point $x_{i}$ would pick $x_{j}$ as its neighbor is:

p_{j ∣ i} = \frac{exp ( - ∣∣ x _{i} - x _{j} ∣ ∣ ^{2} /2 σ _{i}^{2} )}{\sum _{k \neq = i} exp ( - ∣∣ x _{i} - x _{k} ∣ ∣ ^{2} /2 σ _{i}^{2} )}

where $σ_{i}$ is the variance of the Gaussian centered on $x_{i}$ . This variance is determined by the perplexity parameter.

To make the similarity metric symmetric, we use:

p_{ij} = \frac{p _{j ∣ i} + p _{i ∣ j}}{2 n}

Low-Dim Similarities

In the low-dimensional map, t-SNE uses a Student t-distribution with one degree of freedom (also known as the Cauchy distribution). This heavy-tailed distribution is key to avoiding the “crowding problem”:

q_{ij} = \frac{( 1 + ∣∣ y _{i} - y _{j} ∣ ∣ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∣∣ y _{k} - y _{l} ∣ ∣ ^{2} ) ^{- 1}}

where $y_{i}$ and $y_{j}$ are the low-dimensional representations of $x_{i}$ and $x_{j}$ .

Optimization Objective

T-SNE minimizes the Kullback-Leibler divergence between these two distributions:

C = KL (P ∣∣ Q) = i \sum j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}}

The gradient of this cost function with respect to the low-dimensional points is:

\frac{δ C}{δ y _{i}} = 4 j \sum (p_{ij} - q_{ij}) (y_{i} - y_{j}) (1 + ∣∣ y_{i} - y_{j} ∣ ∣^{2})^{- 1}

This gradient is optimized using gradient descent with momentum.

Perplexity and $σ_{i}$

The perplexity is related to the entropy of the conditional probability distribution:

Perp (P_{i}) = 2^{H (P_{i})} = 2^{- \sum_{j} p_{j ∣ i} l o g_{2} p_{j ∣ i}}

For each point $i$ , a binary search finds the $σ_{i}$ that produces the desired perplexity. This means each point effectively has a different “bandwidth” depending on the local density of the data.

Intuition

How T-SNE works can be understood in two main steps:

Build a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, while dissimilar points have an extremely small probability of being picked.
Then, define a similar probability distribution over the points in the low-dimensional map, and minimize the Kullback-Leibler divergence between the two distributions with respect to the locations of the points in the map.

A key parameter in T-SNE is perplexity, which can be thought of as a smooth measure of the effective number of neighbors. It influences how the algorithm balances attention between local and global aspects of the data. Lower perplexity values focus more on local structures, while higher values capture broader patterns.

The different perplexity values 5-100 and how they change the depiction of the data.

Limitations

While a great approximation for many dimension reduction tasks, T-SNE has a few key issues. Namely, it is completely non-deterministic, and is unable to generalize to new data like Principle Component Analysis. It is a “trained” model in the sense that it learns this data distribution specifically, and that also comes with the downside of computational complexity, $O (N^{2})$ . In addition, many features of the visualization have no inherent meaning, namely that the distances between different clusters hold no value, and the cluster sizes themselves are arbitrary

When To Use What

Use t-SNE when: You want to visualize high-dimensional data with focus on local structure and clustering patterns.

Use PCA when: You need deterministic results, global structure preservation, or the ability to transform new data.

Use UMAP when: You want fast computation with both local and global structure preserved and the ability to transform new data.

Graph View

T-distributed stochastic neighbor embedding

Formulation

High-Dim Similarities

Low-Dim Similarities

Optimization Objective

Perplexity and $σ_{i}$

Intuition

Limitations

When To Use What

Table of Contents

Backlinks

Explore

Explore

Graph View

T-distributed stochastic neighbor embedding

Formulation

High-Dim Similarities

Low-Dim Similarities

Optimization Objective

Perplexity and σi​

Intuition

Limitations

When To Use What

Table of Contents

Backlinks

Explore

Explore

Perplexity and $σ_{i}$