ML Atlas

t-SNE

Visualize high-dimensional data by preserving local neighborhoods in 2D.

IntermediateUnsupervisedDim. Reduction
32 min read
Probability distributions (Gaussian, Student-t)KL divergence and information theory basicsMatrix operations and gradient calculusUnderstanding of PCA (what dimensionality reduction is)
  • Single-cell RNA-seq visualization in computational biology (most common scientific use)
  • Word embedding visualization (GPT / BERT token clusters)
  • Image feature space exploration (CNN intermediate layer activations)
  • Fraud detection: visualizing transaction embedding clusters before labeling
  • Drug discovery: visualizing molecular fingerprint similarity landscapes
  • Customer segmentation: understanding cluster separation before K-Means
01

In Plain English

t-SNE is a technique that takes data with hundreds or thousands of dimensions and crunches it down to 2D or 3D so you can plot it. It does this by preserving which points are neighbors of which — nearby points in high dimensions end up nearby in 2D.

Why It Exists

PCA was the standard dimensionality reduction tool for decades, but it only preserves global linear structure. Real data (gene expression, images, text embeddings) has complex non-linear cluster structure that PCA destroys. t-SNE was designed specifically to reveal this cluster structure visually.

Problem It Solves

Given a dataset of n points in d dimensions (d can be hundreds or thousands), find a 2D or 3D embedding where points that were neighbors in high-d are still neighbors in low-d. The goal is visual insight into cluster structure, not a general-purpose projection.

Real-Life Analogy

"Imagine you have thousands of people and you know, for each person, their 500 closest friends. You want to draw them all on a map such that best friends sit next to each other. t-SNE does exactly this: it builds a friendship map from high-dimensional proximity and then projects it onto paper while keeping friendships intact."

When To Use

  • Exploratory data analysis — before clustering, to visually confirm there are distinct groups
  • Visualizing embedding spaces (NLP tokens, CNN features, gene expression profiles)
  • Debugging ML models — checking if learned representations make intuitive sense
  • Class overlap analysis — do classes separable in high-d also look separable in 2D?
  • Single-cell sequencing and other high-dimensional biology tasks
  • Understanding anomalies — outlier points appear isolated in t-SNE plots

When NOT To Use

  • When you need to preserve global distances (t-SNE does NOT preserve inter-cluster distances)
  • When you need a deterministic, reproducible algorithm without random initialization
  • When you need to project new points without refitting (t-SNE has no out-of-sample extension)
  • Preprocessing for machine learning models — use PCA or autoencoders instead
  • When dataset has > 100K points (t-SNE is O(n²) and very slow without approximations)
  • When you need to interpret exact coordinate positions — t-SNE coordinates are meaningless
02

In high-dimensional space, t-SNE starts by computing a probability distribution over pairs of points: nearby points get high probability of being 'neighbors' and far-apart points get near-zero probability. This distribution is defined using a Gaussian kernel, and the width of the Gaussian (controlled by perplexity) is adapted per-point so that each point has approximately the same number of effective neighbors.

Then t-SNE creates a matching distribution over pairs of points in 2D using a Student-t distribution with one degree of freedom (which is actually just a Cauchy distribution). It optimizes the 2D coordinates so that this low-dimensional distribution is as similar as possible to the high-dimensional one. Similarity is measured by KL divergence.

The crucial design choice: using a Student-t distribution in low dimensions instead of a Gaussian. The Student-t has much heavier tails than a Gaussian. This means that in 2D, moderately distant points get much more repulsion than they would under a Gaussian. This solves the 'crowding problem': in high-d, moderately distant points can coexist in the crowded space, but in 2D there's no room for them — the heavy-tailed distribution gives them space by pushing them further away, naturally creating the crisp cluster separations t-SNE is famous for.

SNE (Stochastic Neighbor Embedding), the predecessor, used a Gaussian in both high-d and low-d. The crowding problem was severe: clusters collapsed into indistinct blobs. t-SNE's only major change from SNE was replacing the low-dimensional Gaussian with a Student-t. This single change dramatically improved visualization quality.

The Metaphor

"Think of social cliques in a school. In reality (high dimensions), students have many acquaintances at different levels of closeness — a rich social fabric. When you photograph the school (project to 2D), everyone gets squished together. A Gaussian lens would create an indistinct crowd. A Student-t lens (heavy tails) pushes acquaintances far away while keeping best friends glued together — creating the distinct friend-group clusters you'd expect."

Beginner Mental Model

Step 1: For every pair of points, compute how close they are in high-d (using Gaussian probabilities). Step 2: Place all points randomly in 2D. Step 3: Use gradient descent to move points in 2D so that their closeness pattern matches what it was in high-d. Points that should be close get attracted; points that should be far get repelled. After many iterations, you get a stable 2D picture where clusters in high-d appear as clusters in 2D.

03

Given n points {x₁, …, xₙ} ∈ ℝᵈ, t-SNE finds a low-dimensional embedding {y₁, …, yₙ} ∈ ℝ² (or ℝ³) by minimizing the KL divergence between the high-dimensional joint probability distribution P (Gaussian-based) and the low-dimensional distribution Q (Student-t-based): C = KL(P‖Q) = Σᵢ Σⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ), where pᵢⱼ is the symmetrized probability of i and j being neighbors in high-d and qᵢⱼ is the probability in low-d.

Perplexity
A hyperparameter controlling the effective number of neighbors per point. Formally, perplexity = 2^H(Pᵢ) where H(Pᵢ) is the Shannon entropy of the conditional distribution for point i. Typical values: 5–50. Higher perplexity → more global structure; lower → more local.
Crowding Problem
In high dimensions, moderately distant points can be accommodated in the space around each point. When projected to 2D, there is not enough space for all these moderately distant points — they get crowded into the center. t-SNE's heavy-tailed Q distribution alleviates this by creating extra repulsion for mid-range points.
SNE (Stochastic Neighbor Embedding)
The predecessor to t-SNE (Hinton & Roweis, 2002). Uses Gaussian distributions in both high-d and low-d. Suffers from the crowding problem and harder optimization (asymmetric KL divergence). t-SNE (van der Maaten & Hinton, 2008) fixed both issues.
Student-t Distribution (1 df)
The probability distribution used for low-dimensional similarities in t-SNE. With 1 degree of freedom it reduces to the Cauchy distribution: q(|y|) ∝ (1 + |y|²)⁻¹. Its heavy tails allow points that are moderately far in high-d to be pushed much further apart in 2D.
Early Exaggeration
An optimization trick in t-SNE: during the first ~250 iterations, multiply all pᵢⱼ by a large constant (typically 4 or 12). This makes clusters attract strongly and repel more cleanly, forming well-separated groups early. Exaggeration is then turned off for the remainder of optimization.
KL Divergence (Kullback-Leibler)
The objective t-SNE minimizes: KL(P‖Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ/qᵢⱼ). It is asymmetric: it heavily penalizes modeling nearby high-d points as far apart in 2D (q≈0 when p is large). This asymmetry is what makes t-SNE preserve local neighborhoods over global structure.
Barnes-Hut Approximation
An O(n log n) approximate version of t-SNE that uses a quadtree to approximate gradient contributions from far-away points. sklearn's TSNE uses this by default (method='barnes_hut'), making it feasible for ~100K points.
  1. 1. Compute pairwise distances ||xᵢ - xⱼ||² for all pairs in high-dimensional space.
  2. 2. For each point i, find σᵢ (Gaussian bandwidth) such that the perplexity of the conditional distribution Pⱼ|ᵢ equals the target perplexity (binary search per point).
  3. 3. Compute conditional probabilities: pⱼ|ᵢ = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||²/2σᵢ²).
  4. 4. Symmetrize: pᵢⱼ = (pⱼ|ᵢ + pᵢ|ⱼ) / 2n. This is the high-d joint distribution P.
  5. 5. Initialize 2D embeddings yᵢ ~ N(0, 0.0001·I) (small random initialization).
  6. 6. Apply early exaggeration: multiply all pᵢⱼ by 4 (or 12) for the first ~250 iterations.
  7. 7. Compute low-d similarities: qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹.
  8. 8. Compute gradient of KL divergence with respect to yᵢ.
  9. 9. Update yᵢ using gradient descent with momentum (Adam or SGD with momentum).
  10. 10. Turn off early exaggeration after ~250 iterations and continue optimizing for ~750 more iterations.
  11. 11. Return the final 2D coordinates {y₁, …, yₙ}.

Feature matrix X ∈ ℝⁿˣᵈ (n samples, d features). Typically preprocessed with StandardScaler. For large d (> 50), it's common practice to first reduce to 50 dimensions with PCA before applying t-SNE.

2D (or 3D) embedding matrix Y ∈ ℝⁿˣ² where each row is the 2D coordinate of the corresponding input point. These coordinates have no intrinsic interpretation — only relative positions matter.

01Local neighborhood structure is meaningful and informative in the original high-d space.
02Data does not have extreme intrinsic dimensionality (works best when true structure is low-dimensional).
03The user understands that cluster sizes and inter-cluster distances in t-SNE output are NOT meaningful — only topology (which points are near which) matters.
04Perplexity is chosen appropriately for the dataset size (rule of thumb: perplexity < n/3 and perplexity 5–50).
  • Very small datasets (n < 30): perplexity must be < n/3 or the algorithm degenerates; use PCA instead.
  • Very large datasets (n > 100K): standard t-SNE is O(n²) and infeasible; use Barnes-Hut approximation or switch to UMAP.
  • Perplexity > n/3: the binary search for σᵢ may not converge; sklearn will warn.
  • All points identical: trivial degenerate case — all similarities are equal and the embedding is random.
  • Pre-applying PCA with too few components: destroys information that t-SNE needs; 50 PCA components is standard.
04

t-SNE is an end-of-pipeline visualization tool. It is applied after all feature engineering, scaling, and optional PCA pre-reduction. It produces coordinates used only for plotting — never for downstream ML models.

  • 01.Standardize features with StandardScaler (zero mean, unit variance) — t-SNE is sensitive to scale differences.
  • 02.For high-d data (d > 50), apply PCA first to reduce to 50 components. This denoises the data and speeds up t-SNE dramatically without losing cluster structure.
  • 03.Handle missing values before t-SNE — impute with mean/median or use a dedicated imputer.
  • 04.Remove or winsorize severe outliers — they can distort the neighborhood structure.
  • 05.For sparse data (e.g., text TF-IDF), consider dense embedding (e.g., LSA/NMF) before t-SNE.
  • 06.Normalize if features are on wildly different scales — this is captured by StandardScaler.
  • 01.Choose perplexity: start with 30 for most datasets; try 5, 30, and 50 to see how structure changes.
  • 02.Set n_iter >= 1000 (default 1000 is usually sufficient; complex data may need 2000–5000).
  • 03.Run t-SNE multiple times with different random seeds — results will differ; look for consistent structure.
  • 04.Use early_exaggeration=12 (sklearn default) which was empirically found to work well.
  • 05.Learning rate: sklearn default is 'auto' (= max(n/early_exaggeration/4, 50)) — rarely needs changing.
  • 06.Inspect KL divergence at convergence (printed if verbose=1) — lower is better, but absolute value depends on dataset.
  • 07.Color points by known labels/metadata to validate that t-SNE clusters align with ground truth.

perplexity

Controls the effective number of neighbors each point considers. Balances local vs. global structure. Related to bandwidth σᵢ via binary search.

30 (range: 5–50; for very large n try up to 100)

n_iter

Number of optimization iterations (gradient descent steps).

1000 (increase to 2000–5000 for complex datasets)

learning_rate

Step size for gradient descent. sklearn auto-computes a good default: max(n / early_exaggeration / 4, 50).

'auto' or 200

early_exaggeration

Multiplier applied to pᵢⱼ in the first phase of optimization to encourage well-separated clusters.

12 (sklearn default)

n_components

Dimensionality of the output embedding.

2 (for visualization); 3 for 3D plots

metric

Distance metric used for computing high-dimensional pairwise distances.

'euclidean' (default); 'cosine' for text/embeddings; 'precomputed' if you have custom distances

  1. 1pip install scikit-learn numpy matplotlib seaborn
  2. 2Load and explore data: df.shape, df.describe(), check for NaNs
  3. 3Preprocess: StandardScaler → (optional) PCA(n_components=50)
  4. 4Instantiate: tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
  5. 5Fit and transform: embedding = tsne.fit_transform(X_scaled)
  6. 6Plot: plt.scatter(embedding[:,0], embedding[:,1], c=labels, cmap='tab10')
  7. 7Repeat with different perplexity values (5, 30, 50) to verify robustness of cluster structure
05
06
python
1import numpy as np
2
3
4def compute_pairwise_distances(X):
5    """Squared Euclidean pairwise distances."""
6    sum_sq = np.sum(X ** 2, axis=1)
7    D = sum_sq[:, None] + sum_sq[None, :] - 2 * X @ X.T
8    np.fill_diagonal(D, 0.0)
9    return D
10
11
12def gaussian_affinities(D, perplexity=30.0, tol=1e-5, max_iter=50):
13    """
14    Compute symmetrized joint probabilities P via binary search for sigma.
15    Returns P (n x n) with zeros on diagonal.
16    """
17    n = D.shape[0]
18    P = np.zeros((n, n))
19    log_perp = np.log(perplexity)
20
21    for i in range(n):
22        beta_min, beta_max = -np.inf, np.inf
23        beta = 1.0  # beta = 1 / (2 * sigma_i^2)
24        Di = D[i, np.concatenate([np.arange(0, i), np.arange(i + 1, n)])]
25
26        for _ in range(max_iter):
27            exp_Di = np.exp(-Di * beta)
28            sum_exp = exp_Di.sum()
29            H = np.log(sum_exp + 1e-10) + beta * np.sum(Di * exp_Di) / (sum_exp + 1e-10)
30            H_diff = H - log_perp
31
32            if abs(H_diff) < tol:
33                break
34
35            if H_diff > 0:   # entropy too high → increase beta (narrow Gaussian)
36                beta_min = beta
37                beta = (beta + (beta_max if beta_max != np.inf else beta * 2)) / 2
38            else:            # entropy too low → decrease beta
39                beta_max = beta
40                beta = (beta + (beta_min if beta_min != -np.inf else beta / 2)) / 2
41
42        p_i = np.exp(-Di * beta)
43        p_i /= p_i.sum()
44        mask = np.concatenate([np.arange(0, i), np.arange(i + 1, n)])
45        P[i, mask] = p_i
46
47    # Symmetrize and normalize
48    P = (P + P.T) / (2 * n)
49    P = np.maximum(P, 1e-12)  # numerical stability
50    return P
51
52
53def student_t_affinities(Y):
54    """Low-dim Student-t (1 df) joint probabilities Q."""
55    sum_sq = np.sum(Y ** 2, axis=1)
56    D = sum_sq[:, None] + sum_sq[None, :] - 2 * Y @ Y.T
57    np.fill_diagonal(D, 0.0)
58    Q_num = (1.0 + D) ** -1
59    np.fill_diagonal(Q_num, 0.0)
60    Q = Q_num / Q_num.sum()
61    Q = np.maximum(Q, 1e-12)
62    return Q, Q_num
63
64
65def tsne_gradient(P, Q, Q_num, Y):
66    """Compute gradient of KL(P||Q) w.r.t. Y."""
67    n = Y.shape[0]
68    PQ_diff = (P - Q)[:, :, None]      # (n, n, 1)
69    Q_num_3d = Q_num[:, :, None]        # (n, n, 1)
70    YY_diff = Y[:, None, :] - Y[None, :, :]  # (n, n, 2)
71    # grad[i] = 4 * sum_j (p_ij - q_ij) * q_num_ij * (y_i - y_j)
72    grad = 4.0 * np.sum(PQ_diff * Q_num_3d * YY_diff, axis=1)  # (n, 2)
73    return grad
74
75
76def tsne(X, n_components=2, perplexity=30.0, n_iter=1000,
77         learning_rate=200.0, early_exaggeration=12.0,
78         exaggeration_iter=250, random_state=42):
79    """
80    Simplified t-SNE implementation.
81    Note: O(n^2) — only suitable for n <= ~3000.
82    """
83    rng = np.random.RandomState(random_state)
84    n = X.shape[0]
85
86    # Step 1: High-dim affinities
87    print("Computing pairwise distances...")
88    D = compute_pairwise_distances(X)
89    print("Computing Gaussian affinities (binary search for sigma)...")
90    P = gaussian_affinities(D, perplexity=perplexity)
91
92    # Step 2: Initialize Y
93    Y = rng.randn(n, n_components) * 1e-4
94    Y_prev = Y.copy()
95    momentum = 0.5
96    gains = np.ones_like(Y)
97    kl_history = []
98
99    # Step 3: Gradient descent with momentum
100    print("Optimizing embedding...")
101    for t in range(1, n_iter + 1):
102        # Early exaggeration phase
103        P_eff = P * early_exaggeration if t <= exaggeration_iter else P
104        if t == exaggeration_iter + 1:
105            momentum = 0.8  # switch to higher momentum after exaggeration
106
107        Q, Q_num = student_t_affinities(Y)
108        grad = tsne_gradient(P_eff, Q, Q_num, Y)
109
110        # Adaptive learning rate per dimension (simplified)
111        gains = (gains + 0.2) * ((grad > 0) != (Y - Y_prev > 0)) +                 (gains * 0.8) * ((grad > 0) == (Y - Y_prev > 0))
112        gains = np.maximum(gains, 0.01)
113
114        Y_new = Y - learning_rate * gains * grad + momentum * (Y - Y_prev)
115        Y_prev = Y.copy()
116        Y = Y_new
117
118        # Track KL divergence
119        if t % 100 == 0:
120            kl = np.sum(P * np.log(P / Q))
121            kl_history.append((t, kl))
122            print(f"  Iter {t:4d} | KL divergence: {kl:.4f}")
123
124    return Y, kl_history
125
126
127# ── Demo ──────────────────────────────────────────────────────────────────────
128from sklearn.datasets import load_digits
129from sklearn.preprocessing import StandardScaler
130from sklearn.decomposition import PCA
131
132digits = load_digits()
133X = digits.data       # (1797, 64)
134y = digits.target
135
136# Preprocess: scale then PCA to 30 dims (standard practice)
137X_scaled = StandardScaler().fit_transform(X)
138X_pca = PCA(n_components=30, random_state=42).fit_transform(X_scaled)
139
140# Run t-SNE on a subset for speed (our implementation is O(n^2))
141idx = np.random.RandomState(0).choice(len(X_pca), 500, replace=False)
142embedding, kl_hist = tsne(X_pca[idx], perplexity=30, n_iter=1000,
143                           learning_rate=200, random_state=42)
144
145print(f"\nFinal embedding shape: {embedding.shape}")
146print(f"Final KL divergence: {kl_hist[-1][1]:.4f}")
147# Expect: 10 distinct clusters (digits 0-9) visible in a scatter plot
The from-scratch implementation reveals the key mechanics: (1) binary search for per-point sigma so every point has the same effective neighbor count; (2) symmetrized P to make the problem well-behaved; (3) Student-t Q for heavy-tailed low-d similarities; (4) gradient as attractive/repulsive forces; (5) early exaggeration to form initial cluster structure. The adaptive gains (per-parameter learning rate scaling) are t-SNE's specific momentum trick from the original paper.
X = digits.data  # shape (1797, 64) — pixel intensities of handwritten digits
y = digits.target  # shape (1797,) — digit labels 0-9
embedding shape: (1797, 2)
KL divergence at convergence: ~0.85
Plot shows 10 distinct clusters, one per digit class, with digit '1' and '7' sometimes overlapping (visually similar strokes)
  • Always apply PCA (n_components=50) before t-SNE when d > 50. This reduces runtime significantly and often improves embedding quality by denoising.
  • Run t-SNE multiple times with different random seeds. If the same cluster structure appears consistently, it reflects real structure. If it changes every run, the structure may be an artifact.
  • init='pca' is more stable than init='random' and typically requires fewer iterations to converge.
  • Cluster sizes and inter-cluster distances in t-SNE are NOT interpretable. Two large clusters and one small cluster does NOT mean the small cluster is less important.
  • The kl_divergence_ attribute after fitting is a convergence diagnostic — if it's still decreasing rapidly, increase n_iter.
  • For very large datasets (n > 50K), use method='barnes_hut' (default) and consider UMAP as a faster, often better alternative.
  • Not applying PCA before t-SNE on high-d data — dramatically slows computation and adds noise.
  • Running t-SNE only once and trusting the result — always run multiple times to check stability.
  • Interpreting cluster sizes as cluster importance — t-SNE distorts cluster sizes arbitrarily.
  • Using t-SNE coordinates as input features to a classifier — t-SNE has no transform() for new data and is stochastic.
  • Comparing inter-cluster distances across different t-SNE runs or perplexity settings — distances are not preserved.
  • Setting perplexity larger than n/3 — this violates t-SNE's assumptions and produces degenerate results.
07
🧠

High-dimensional embeddings (NLP, vision)

Excellent

t-SNE was designed for exactly this: visualizing dense, high-dimensional feature spaces like word embeddings, BERT representations, or CNN activations. The Gaussian affinities capture semantic similarity beautifully.

💡 Use cosine metric for NLP embeddings. Apply PCA(50) first if d > 50.
🧬

Genomics / single-cell RNA-seq

Excellent

Single-cell biology is t-SNE's most prominent scientific application. Thousands of genes per cell compressed to 2D for cell-type visualization. t-SNE is considered standard in computational biology.

💡 Use Scanpy (Python bioinformatics library) which has optimized t-SNE/UMAP for single-cell data.
📊

Small-medium tabular (n < 10K)

Good

Works well for exploring structure in tabular data with many features. Not necessary if d is already small (< 10) — just plot directly.

💡 Standard sklearn TSNE is feasible. Ensure features are scaled and consider PCA pre-reduction.
🗄️

Large dataset (n > 100K)

Poor

Even Barnes-Hut t-SNE is O(n log n) in time and O(n) in memory, making it slow at this scale. A 1M-point t-SNE can take hours.

💡 Switch to UMAP for large-scale data. If t-SNE is required, subsample to 50K points.
⚖️

Imbalanced multi-class data

Context-Dependent

t-SNE can reveal class imbalance visually — rare classes appear as small isolated clusters. But cluster SIZE in t-SNE is not proportional to class size.

💡 Color by class to check overlap. Use perplexity tuning — low perplexity reveals small cluster structure better.
📈

Time-series / sequential data

Poor

t-SNE does not understand temporal order — it treats each time step as an independent point. Trajectory structure is lost.

💡 Use UMAP with a temporal connectivity matrix, or embed sequences before applying t-SNE.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: tsne

KL Divergence Convergence Curve

t-SNE minimizes KL(P‖Q) via gradient descent. The curve should drop steeply in the early exaggeration phase (first ~250 iterations), then continue decreasing more gradually. A flattened curve indicates convergence. If KL is still dropping at n_iter, increase the iteration count.

Gradient descent convergence — MSE decreasing over iterations

Perplexity Effect on Cluster Structure

Low perplexity (5) creates many small, tight, fragmented clusters. High perplexity (50) creates fewer, larger, more connected groups — more global structure. The 'right' perplexity reveals the structure you're looking for. Always compare multiple settings.

Scores are qualitative (1–10). The 'best' perplexity is dataset-dependent. Always try at least 3 values and report consistent findings.
09
  • Exceptional cluster visualization

    t-SNE consistently produces visually striking, well-separated cluster plots for real-world high-dimensional data. No other standard technique matches its ability to reveal non-linear cluster structure in 2D.

  • Handles non-linear structure

    Unlike PCA (which only captures linear variance), t-SNE can reveal manifold structure, curved clusters, and complex topologies that linear methods completely miss.

  • Adaptive bandwidth (perplexity)

    The per-point sigma adaptation means t-SNE handles datasets with varying local densities gracefully. Dense and sparse regions are both represented with appropriate neighborhood sizes.

  • No assumptions about cluster shape

    t-SNE makes no assumptions that clusters are spherical, convex, or linearly separable. It can reveal elongated, curved, or nested clusters.

  • Solves the crowding problem

    The Student-t distribution's heavy tails provide the extra repulsion needed to prevent all points from collapsing into the center of the 2D space — a fundamental limitation of the earlier SNE algorithm.

  • Scientifically validated and widely adopted

    t-SNE has been cited over 15,000 times and is the de facto visualization standard in computational biology, NLP, and computer vision research.

  • Non-deterministic and run-dependent

    With random initialization, each t-SNE run produces a different layout. Two valid t-SNE plots of the same data can look completely different. You must run multiple times and look for consistent structure.

  • Does not preserve global distances

    The distance between two clusters in a t-SNE plot is meaningless. Cluster A being 'closer' to cluster B than to cluster C in 2D says nothing about true high-d distances between the clusters.

  • No out-of-sample extension

    t-SNE has no transform() method. To embed a new point, you must refit the entire model on the extended dataset. This makes it impossible to use in production pipelines that require embedding new data.

  • Quadratic computational complexity

    Exact t-SNE is O(n²) in both time and memory. Even Barnes-Hut approximation is O(n log n). For n > 100K, runtime becomes prohibitive and UMAP is the practical choice.

  • Cluster sizes are meaningless artifacts

    The size of a cluster in t-SNE output does not correspond to the number of points in the cluster or its spread in high-d space. This is one of the most commonly misinterpreted aspects of t-SNE.

  • Sensitive to hyperparameter choices

    Different perplexity values can make the same dataset look like it has 5 or 20 clusters. Without domain knowledge, it can be unclear which structure is 'real'. Always report hyperparameters used.

  • Cannot be used as a preprocessing step

    Because t-SNE is stochastic, non-invertible, and distorts distances, using its output as features for a classifier is a bad practice. Use PCA or autoencoders for preprocessing.

10
Computational Biology

Single-cell RNA-seq cell type visualization

Each cell is a high-dimensional gene expression vector. t-SNE projects thousands of cells to 2D, revealing distinct cell-type clusters (T-cells, B-cells, stem cells) without knowing labels in advance. Standard in papers from major biology journals.

Natural Language Processing

Word and sentence embedding visualization

Visualizing word2vec, GloVe, or BERT embeddings in 2D. Semantically similar words cluster together — king/queen near royalty words, countries near their capitals. Used to validate embedding quality.

Computer Vision

CNN feature space exploration

Take activations from the penultimate layer of a trained ResNet and visualize with t-SNE. Images of the same category should cluster together. Useful for debugging misclassifications: points that appear in the wrong cluster are likely to be confused by the classifier.

Fraud Detection

Anomaly cluster visualization before labeling

Transaction embeddings visualized with t-SNE. Fraud transactions often form isolated clusters or appear at the periphery. t-SNE helps analysts identify which unlabeled clusters to prioritize for manual review.

Drug Discovery

Chemical space visualization

Molecular fingerprints (bit vectors of chemical substructures) projected to 2D. Compounds with similar biological activity cluster together. Guides lead optimization by showing which chemical neighborhoods have been explored.

Customer Analytics

Customer segmentation validation

After computing customer feature embeddings (RFM, behavioral), t-SNE verifies that K-Means cluster assignments make sense: do customers labeled as 'high value' form a tight distinct cluster? Used before presenting segments to business stakeholders.

11

t-SNE is one of several dimensionality reduction techniques. The right choice depends on your goal: visualization, preprocessing, or compression.

PCA (Principal Component Analysis)

Both reduce dimensionality; both can produce 2D outputs for visualization

PCA is linear, deterministic, fast, and preserves global variance. t-SNE is non-linear, stochastic, slow, and preserves local neighborhood structure. PCA has a transform() for new data; t-SNE does not.

PCA for preprocessing, feature reduction, or when you need an interpretable linear transformation. t-SNE for visualization when you suspect non-linear cluster structure.

UMAP (Uniform Manifold Approximation and Projection)

Both are non-linear, neighborhood-preserving, and produce 2D/3D visualizations

UMAP is much faster (O(n) after graph construction), better preserves global structure than t-SNE, supports out-of-sample embedding (transform()), and is more theoretically grounded (Riemannian geometry). t-SNE is older, more widely cited, and arguably produces tighter clusters for purely local visualization.

UMAP for large datasets (n > 50K), when you need transform() for new data, or when global structure matters. t-SNE for publication-quality 2D cluster visualizations on smaller datasets.

Autoencoders

Both learn non-linear low-dimensional representations

Autoencoders are trained to reconstruct input — they learn a 2D bottleneck that captures the information needed for reconstruction, not neighborhood structure. They have an encoder for new data. t-SNE optimizes purely for neighborhood preservation in 2D.

Autoencoders for preprocessing (their latent space can feed into classifiers), generative modeling, or when you need invertible compression. t-SNE only for visualization.

Isomap

Both are non-linear dimensionality reduction methods

Isomap preserves geodesic (manifold) distances — it's globally distance-preserving on manifolds. t-SNE only preserves local neighborhoods. Isomap is deterministic; t-SNE is stochastic. Isomap fails on non-convex manifolds with holes.

Isomap when the data lies on a smooth convex manifold and global distance preservation matters. t-SNE for cluster visualization without distance constraints.

Propertyt-SNEPCAUMAPAutoencoder
Preserves local structure✓ ExcellentPartial✓ GoodPartial
Preserves global structure✗ No✓ YesPartialPartial
Out-of-sample transform✗ No✓ Yes✓ Yes✓ Yes
Deterministic✗ No✓ Yes✗ No✗ No
Scalability (n > 100K)✗ Slow✓ Fast✓ Fast✓ Fast
Interpretable axes✗ No✓ Yes (PCs)✗ No✗ No
Best for visualization✓ BestLimited✓ GoodLimited

You want the best possible 2D cluster visualization for a dataset with n < 50K, you have time to run multiple seeds, and you don't need to embed new points or use the result as ML features.

12

KL Divergence (Convergence Diagnostic)

The optimization objective. Lower means the 2D embedding better matches the high-d neighborhood structure. Access via tsne.kl_divergence_ after fit. Used only to confirm convergence — lower is better but absolute values are not comparable across datasets.

Target: Depends on dataset; check that it has plateaued

Trustworthiness

Measures how often k-nearest neighbors in the embedding were also neighbors in high-d. T=1: embedding perfectly preserves local structure. T<0.9: many false neighbors introduced. Available in sklearn.manifold.trustworthiness.

Target: > 0.90

Continuity

Complement of trustworthiness: measures how often true high-d neighbors are preserved in the embedding (vs. trustworthiness which checks for false neighbors). Use both together for a complete picture.

Target: > 0.90

Silhouette Score (on embedding)

If you have labels, compute silhouette score on the 2D t-SNE coordinates using those labels. High score means labeled clusters are well-separated in the embedding — a good sign that t-SNE has captured meaningful structure. Not a quality metric for t-SNE itself, but for the downstream clustering.

Target: > 0.5 for well-separated clusters

  1. 01.1. Verify convergence: check tsne.kl_divergence_ — compare to earlier iterations (verbose=1).
  2. 02.2. Run with 3 different random_state values — consistent cluster topology means real structure.
  3. 03.3. Compute trustworthiness with sklearn.manifold.trustworthiness(X_pca, embedding, n_neighbors=10).
  4. 04.4. Color by known labels or metadata — does t-SNE cluster agree with ground truth?
  5. 05.5. Try 3 perplexity values (5, 30, 50) — consistent clusters across settings are more credible.
  6. 06.6. If you have clusters, compute silhouette score on the embedding to quantify separation.
  • Reporting t-SNE as evidence of a specific number of clusters — the number of visual clusters depends on perplexity, not just the data.
  • Comparing t-SNE plots from different runs without the same perplexity and n_iter — they are not comparable.
  • Using t-SNE coordinates as features in a downstream model — they are stochastic and have no out-of-sample meaning.
  • Concluding that two clusters are 'far apart' in high-d because they are far in the t-SNE plot — inter-cluster distances are not preserved.
  • Publishing a t-SNE without reporting perplexity, n_iter, and number of random seeds tested — the plot cannot be reproduced or interpreted.

Single-cell RNA-seq (3000 cells, 2000 genes): after PCA(50) and t-SNE(perplexity=30, n_iter=2000), the plot shows 8 distinct tight clusters. Colored by cell-type marker expression, cluster 3 is high for CD4 (T-helper cell marker) and cluster 7 is high for CD8 (cytotoxic T-cell marker). KL divergence: 1.23 (plateaued after 1500 iterations). Trustworthiness at k=10: 0.94. Consistent structure confirmed across 5 random seeds. Conclusion: the 8 clusters are real biological cell types, not t-SNE artifacts.

13
  • ×Believing cluster sizes in t-SNE represent the relative sizes of those clusters in real data.
  • ×Running t-SNE once and reporting the result without checking stability across random seeds.
  • ×Not applying PCA before t-SNE on high-dimensional data — adds noise and greatly slows computation.
  • ×Thinking the x-axis and y-axis of a t-SNE plot have meaningful units or directions — they don't.
  • ×Using t-SNE for preprocessing (feeding its output to a classifier) — it's visualization-only.
  • ×Not setting random_state — results differ every run, making debugging impossible.
  • ×Setting perplexity too high (close to n) or too low (< 5) without understanding the effect.
  • ×Using t-SNE on the raw high-d features when PCA pre-reduction would have improved both quality and speed.
  • ×Not checking whether KL divergence has converged (increase n_iter if it hasn't plateaued).
  • ×Calling tsne.transform(X_new) — t-SNE has no transform method; it will raise AttributeError.
  • ×Saying 't-SNE preserves distances' — it preserves neighborhoods, not distances. Global distances between clusters are meaningless.
  • ×Not knowing what perplexity controls or its typical range (5–50).
  • ×Saying t-SNE is 'better than PCA' without qualification — they have completely different purposes.
  • ×Not mentioning that t-SNE is non-deterministic when discussing reproducibility.
  • ×Claiming t-SNE can be used to embed new test points — it cannot without retraining.
  • ×Publishing t-SNE plots without reporting hyperparameters (perplexity, n_iter, random seed) — not reproducible.
  • ×Drawing conclusions about inter-cluster biological distance from t-SNE plots — distances between clusters are not meaningful.
  • ×Using t-SNE to 'prove' the number of clusters — the visible cluster count depends heavily on perplexity.
  • ×Applying t-SNE to n > 200K without switching to UMAP or subsampling — will run for hours or run out of memory.
  • ×Not doing a sanity-check with multiple perplexity values — a finding is only credible if it persists across settings.
14

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • t-SNE converts high-d pairwise distances to probabilities (Gaussian P) and minimizes KL(P‖Q) in 2D
  • Low-d similarities Q use Student-t (1 df) — heavy tails solve the crowding problem vs. SNE's Gaussian
  • Perplexity controls effective neighbor count; typical range 5–50; try multiple values
  • Early exaggeration (first ~250 iters) multiplies P by 12 to form well-separated initial clusters
  • KL divergence objective: heavily penalizes mapping true neighbors far apart; lenient about mapping non-neighbors close
  • Cluster sizes and inter-cluster distances in t-SNE output are NOT meaningful — only local topology is
  • Always: scale → PCA(50) → t-SNE; run multiple seeds; check convergence (kl_divergence_)
  • No out-of-sample transform() — cannot embed new points without refitting
  • For n > 50K, use UMAP instead
High-d similarity (Gaussian)
Low-d similarity (Student-t)
KL divergence objective
Gradient
  • 2D/3D visualization of high-dimensional cluster structure
  • Validating embeddings (NLP, vision, genomics)
  • Exploratory data analysis before clustering
  • Presenting intuitive visual summaries of complex data
  • n > 100K (use UMAP)
  • You need to embed new points (use PCA or UMAP)
  • You need reproducible deterministic output
  • You need to preserve global distances between clusters
  • You're preprocessing for ML (use PCA or autoencoders)
Explain the crowding problem and why Student-t solves it vs. SNE's Gaussian
Describe what perplexity controls and typical values
State clearly that cluster sizes and inter-cluster distances are not meaningful
Know the KL divergence objective and its asymmetry (preserves local, ignores global)
Compare t-SNE vs. PCA vs. UMAP: purpose, scalability, determinism, out-of-sample support
Know that t-SNE has no transform() method
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.