In Plain English
t-SNE is a technique that takes data with hundreds or thousands of dimensions and crunches it down to 2D or 3D so you can plot it. It does this by preserving which points are neighbors of which — nearby points in high dimensions end up nearby in 2D.
Why It Exists
PCA was the standard dimensionality reduction tool for decades, but it only preserves global linear structure. Real data (gene expression, images, text embeddings) has complex non-linear cluster structure that PCA destroys. t-SNE was designed specifically to reveal this cluster structure visually.
Problem It Solves
Given a dataset of n points in d dimensions (d can be hundreds or thousands), find a 2D or 3D embedding where points that were neighbors in high-d are still neighbors in low-d. The goal is visual insight into cluster structure, not a general-purpose projection.
Real-Life Analogy
"Imagine you have thousands of people and you know, for each person, their 500 closest friends. You want to draw them all on a map such that best friends sit next to each other. t-SNE does exactly this: it builds a friendship map from high-dimensional proximity and then projects it onto paper while keeping friendships intact."
When To Use
- Exploratory data analysis — before clustering, to visually confirm there are distinct groups
- Visualizing embedding spaces (NLP tokens, CNN features, gene expression profiles)
- Debugging ML models — checking if learned representations make intuitive sense
- Class overlap analysis — do classes separable in high-d also look separable in 2D?
- Single-cell sequencing and other high-dimensional biology tasks
- Understanding anomalies — outlier points appear isolated in t-SNE plots
When NOT To Use
- When you need to preserve global distances (t-SNE does NOT preserve inter-cluster distances)
- When you need a deterministic, reproducible algorithm without random initialization
- When you need to project new points without refitting (t-SNE has no out-of-sample extension)
- Preprocessing for machine learning models — use PCA or autoencoders instead
- When dataset has > 100K points (t-SNE is O(n²) and very slow without approximations)
- When you need to interpret exact coordinate positions — t-SNE coordinates are meaningless
In high-dimensional space, t-SNE starts by computing a probability distribution over pairs of points: nearby points get high probability of being 'neighbors' and far-apart points get near-zero probability. This distribution is defined using a Gaussian kernel, and the width of the Gaussian (controlled by perplexity) is adapted per-point so that each point has approximately the same number of effective neighbors.
Then t-SNE creates a matching distribution over pairs of points in 2D using a Student-t distribution with one degree of freedom (which is actually just a Cauchy distribution). It optimizes the 2D coordinates so that this low-dimensional distribution is as similar as possible to the high-dimensional one. Similarity is measured by KL divergence.
The crucial design choice: using a Student-t distribution in low dimensions instead of a Gaussian. The Student-t has much heavier tails than a Gaussian. This means that in 2D, moderately distant points get much more repulsion than they would under a Gaussian. This solves the 'crowding problem': in high-d, moderately distant points can coexist in the crowded space, but in 2D there's no room for them — the heavy-tailed distribution gives them space by pushing them further away, naturally creating the crisp cluster separations t-SNE is famous for.
SNE (Stochastic Neighbor Embedding), the predecessor, used a Gaussian in both high-d and low-d. The crowding problem was severe: clusters collapsed into indistinct blobs. t-SNE's only major change from SNE was replacing the low-dimensional Gaussian with a Student-t. This single change dramatically improved visualization quality.
The Metaphor
"Think of social cliques in a school. In reality (high dimensions), students have many acquaintances at different levels of closeness — a rich social fabric. When you photograph the school (project to 2D), everyone gets squished together. A Gaussian lens would create an indistinct crowd. A Student-t lens (heavy tails) pushes acquaintances far away while keeping best friends glued together — creating the distinct friend-group clusters you'd expect."
Beginner Mental Model
Step 1: For every pair of points, compute how close they are in high-d (using Gaussian probabilities). Step 2: Place all points randomly in 2D. Step 3: Use gradient descent to move points in 2D so that their closeness pattern matches what it was in high-d. Points that should be close get attracted; points that should be far get repelled. After many iterations, you get a stable 2D picture where clusters in high-d appear as clusters in 2D.
Formal Definition
Given n points {x₁, …, xₙ} ∈ ℝᵈ, t-SNE finds a low-dimensional embedding {y₁, …, yₙ} ∈ ℝ² (or ℝ³) by minimizing the KL divergence between the high-dimensional joint probability distribution P (Gaussian-based) and the low-dimensional distribution Q (Student-t-based): C = KL(P‖Q) = Σᵢ Σⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ), where pᵢⱼ is the symmetrized probability of i and j being neighbors in high-d and qᵢⱼ is the probability in low-d.
Key Terms
- Perplexity
- A hyperparameter controlling the effective number of neighbors per point. Formally, perplexity = 2^H(Pᵢ) where H(Pᵢ) is the Shannon entropy of the conditional distribution for point i. Typical values: 5–50. Higher perplexity → more global structure; lower → more local.
- Crowding Problem
- In high dimensions, moderately distant points can be accommodated in the space around each point. When projected to 2D, there is not enough space for all these moderately distant points — they get crowded into the center. t-SNE's heavy-tailed Q distribution alleviates this by creating extra repulsion for mid-range points.
- SNE (Stochastic Neighbor Embedding)
- The predecessor to t-SNE (Hinton & Roweis, 2002). Uses Gaussian distributions in both high-d and low-d. Suffers from the crowding problem and harder optimization (asymmetric KL divergence). t-SNE (van der Maaten & Hinton, 2008) fixed both issues.
- Student-t Distribution (1 df)
- The probability distribution used for low-dimensional similarities in t-SNE. With 1 degree of freedom it reduces to the Cauchy distribution: q(|y|) ∝ (1 + |y|²)⁻¹. Its heavy tails allow points that are moderately far in high-d to be pushed much further apart in 2D.
- Early Exaggeration
- An optimization trick in t-SNE: during the first ~250 iterations, multiply all pᵢⱼ by a large constant (typically 4 or 12). This makes clusters attract strongly and repel more cleanly, forming well-separated groups early. Exaggeration is then turned off for the remainder of optimization.
- KL Divergence (Kullback-Leibler)
- The objective t-SNE minimizes: KL(P‖Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ/qᵢⱼ). It is asymmetric: it heavily penalizes modeling nearby high-d points as far apart in 2D (q≈0 when p is large). This asymmetry is what makes t-SNE preserve local neighborhoods over global structure.
- Barnes-Hut Approximation
- An O(n log n) approximate version of t-SNE that uses a quadtree to approximate gradient contributions from far-away points. sklearn's TSNE uses this by default (method='barnes_hut'), making it feasible for ~100K points.
Step-by-Step Working
- 1. Compute pairwise distances ||xᵢ - xⱼ||² for all pairs in high-dimensional space.
- 2. For each point i, find σᵢ (Gaussian bandwidth) such that the perplexity of the conditional distribution Pⱼ|ᵢ equals the target perplexity (binary search per point).
- 3. Compute conditional probabilities: pⱼ|ᵢ = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||²/2σᵢ²).
- 4. Symmetrize: pᵢⱼ = (pⱼ|ᵢ + pᵢ|ⱼ) / 2n. This is the high-d joint distribution P.
- 5. Initialize 2D embeddings yᵢ ~ N(0, 0.0001·I) (small random initialization).
- 6. Apply early exaggeration: multiply all pᵢⱼ by 4 (or 12) for the first ~250 iterations.
- 7. Compute low-d similarities: qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹.
- 8. Compute gradient of KL divergence with respect to yᵢ.
- 9. Update yᵢ using gradient descent with momentum (Adam or SGD with momentum).
- 10. Turn off early exaggeration after ~250 iterations and continue optimizing for ~750 more iterations.
- 11. Return the final 2D coordinates {y₁, …, yₙ}.
Inputs
Feature matrix X ∈ ℝⁿˣᵈ (n samples, d features). Typically preprocessed with StandardScaler. For large d (> 50), it's common practice to first reduce to 50 dimensions with PCA before applying t-SNE.
Outputs
2D (or 3D) embedding matrix Y ∈ ℝⁿˣ² where each row is the 2D coordinate of the corresponding input point. These coordinates have no intrinsic interpretation — only relative positions matter.
Model Assumptions
Important Edge Cases
- ▸Very small datasets (n < 30): perplexity must be < n/3 or the algorithm degenerates; use PCA instead.
- ▸Very large datasets (n > 100K): standard t-SNE is O(n²) and infeasible; use Barnes-Hut approximation or switch to UMAP.
- ▸Perplexity > n/3: the binary search for σᵢ may not converge; sklearn will warn.
- ▸All points identical: trivial degenerate case — all similarities are equal and the embedding is random.
- ▸Pre-applying PCA with too few components: destroys information that t-SNE needs; 50 PCA components is standard.
Role in the ML Pipeline
t-SNE is an end-of-pipeline visualization tool. It is applied after all feature engineering, scaling, and optional PCA pre-reduction. It produces coordinates used only for plotting — never for downstream ML models.
Data Preprocessing
- 01.Standardize features with StandardScaler (zero mean, unit variance) — t-SNE is sensitive to scale differences.
- 02.For high-d data (d > 50), apply PCA first to reduce to 50 components. This denoises the data and speeds up t-SNE dramatically without losing cluster structure.
- 03.Handle missing values before t-SNE — impute with mean/median or use a dedicated imputer.
- 04.Remove or winsorize severe outliers — they can distort the neighborhood structure.
- 05.For sparse data (e.g., text TF-IDF), consider dense embedding (e.g., LSA/NMF) before t-SNE.
- 06.Normalize if features are on wildly different scales — this is captured by StandardScaler.
Training Process
- 01.Choose perplexity: start with 30 for most datasets; try 5, 30, and 50 to see how structure changes.
- 02.Set n_iter >= 1000 (default 1000 is usually sufficient; complex data may need 2000–5000).
- 03.Run t-SNE multiple times with different random seeds — results will differ; look for consistent structure.
- 04.Use early_exaggeration=12 (sklearn default) which was empirically found to work well.
- 05.Learning rate: sklearn default is 'auto' (= max(n/early_exaggeration/4, 50)) — rarely needs changing.
- 06.Inspect KL divergence at convergence (printed if verbose=1) — lower is better, but absolute value depends on dataset.
- 07.Color points by known labels/metadata to validate that t-SNE clusters align with ground truth.
Hyperparameters
Name
perplexity
Description
Controls the effective number of neighbors each point considers. Balances local vs. global structure. Related to bandwidth σᵢ via binary search.
Typical
30 (range: 5–50; for very large n try up to 100)
Name
n_iter
Description
Number of optimization iterations (gradient descent steps).
Typical
1000 (increase to 2000–5000 for complex datasets)
Name
learning_rate
Description
Step size for gradient descent. sklearn auto-computes a good default: max(n / early_exaggeration / 4, 50).
Typical
'auto' or 200
Name
early_exaggeration
Description
Multiplier applied to pᵢⱼ in the first phase of optimization to encourage well-separated clusters.
Typical
12 (sklearn default)
Name
n_components
Description
Dimensionality of the output embedding.
Typical
2 (for visualization); 3 for 3D plots
Name
metric
Description
Distance metric used for computing high-dimensional pairwise distances.
Typical
'euclidean' (default); 'cosine' for text/embeddings; 'precomputed' if you have custom distances
Implementation Checklist
- 1
pip install scikit-learn numpy matplotlib seaborn - 2
Load and explore data: df.shape, df.describe(), check for NaNs - 3
Preprocess: StandardScaler → (optional) PCA(n_components=50) - 4
Instantiate: tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42) - 5
Fit and transform: embedding = tsne.fit_transform(X_scaled) - 6
Plot: plt.scatter(embedding[:,0], embedding[:,1], c=labels, cmap='tab10') - 7
Repeat with different perplexity values (5, 30, 50) to verify robustness of cluster structure
1import numpy as np
2
3
4def compute_pairwise_distances(X):
5 """Squared Euclidean pairwise distances."""
6 sum_sq = np.sum(X ** 2, axis=1)
7 D = sum_sq[:, None] + sum_sq[None, :] - 2 * X @ X.T
8 np.fill_diagonal(D, 0.0)
9 return D
10
11
12def gaussian_affinities(D, perplexity=30.0, tol=1e-5, max_iter=50):
13 """
14 Compute symmetrized joint probabilities P via binary search for sigma.
15 Returns P (n x n) with zeros on diagonal.
16 """
17 n = D.shape[0]
18 P = np.zeros((n, n))
19 log_perp = np.log(perplexity)
20
21 for i in range(n):
22 beta_min, beta_max = -np.inf, np.inf
23 beta = 1.0 # beta = 1 / (2 * sigma_i^2)
24 Di = D[i, np.concatenate([np.arange(0, i), np.arange(i + 1, n)])]
25
26 for _ in range(max_iter):
27 exp_Di = np.exp(-Di * beta)
28 sum_exp = exp_Di.sum()
29 H = np.log(sum_exp + 1e-10) + beta * np.sum(Di * exp_Di) / (sum_exp + 1e-10)
30 H_diff = H - log_perp
31
32 if abs(H_diff) < tol:
33 break
34
35 if H_diff > 0: # entropy too high → increase beta (narrow Gaussian)
36 beta_min = beta
37 beta = (beta + (beta_max if beta_max != np.inf else beta * 2)) / 2
38 else: # entropy too low → decrease beta
39 beta_max = beta
40 beta = (beta + (beta_min if beta_min != -np.inf else beta / 2)) / 2
41
42 p_i = np.exp(-Di * beta)
43 p_i /= p_i.sum()
44 mask = np.concatenate([np.arange(0, i), np.arange(i + 1, n)])
45 P[i, mask] = p_i
46
47 # Symmetrize and normalize
48 P = (P + P.T) / (2 * n)
49 P = np.maximum(P, 1e-12) # numerical stability
50 return P
51
52
53def student_t_affinities(Y):
54 """Low-dim Student-t (1 df) joint probabilities Q."""
55 sum_sq = np.sum(Y ** 2, axis=1)
56 D = sum_sq[:, None] + sum_sq[None, :] - 2 * Y @ Y.T
57 np.fill_diagonal(D, 0.0)
58 Q_num = (1.0 + D) ** -1
59 np.fill_diagonal(Q_num, 0.0)
60 Q = Q_num / Q_num.sum()
61 Q = np.maximum(Q, 1e-12)
62 return Q, Q_num
63
64
65def tsne_gradient(P, Q, Q_num, Y):
66 """Compute gradient of KL(P||Q) w.r.t. Y."""
67 n = Y.shape[0]
68 PQ_diff = (P - Q)[:, :, None] # (n, n, 1)
69 Q_num_3d = Q_num[:, :, None] # (n, n, 1)
70 YY_diff = Y[:, None, :] - Y[None, :, :] # (n, n, 2)
71 # grad[i] = 4 * sum_j (p_ij - q_ij) * q_num_ij * (y_i - y_j)
72 grad = 4.0 * np.sum(PQ_diff * Q_num_3d * YY_diff, axis=1) # (n, 2)
73 return grad
74
75
76def tsne(X, n_components=2, perplexity=30.0, n_iter=1000,
77 learning_rate=200.0, early_exaggeration=12.0,
78 exaggeration_iter=250, random_state=42):
79 """
80 Simplified t-SNE implementation.
81 Note: O(n^2) — only suitable for n <= ~3000.
82 """
83 rng = np.random.RandomState(random_state)
84 n = X.shape[0]
85
86 # Step 1: High-dim affinities
87 print("Computing pairwise distances...")
88 D = compute_pairwise_distances(X)
89 print("Computing Gaussian affinities (binary search for sigma)...")
90 P = gaussian_affinities(D, perplexity=perplexity)
91
92 # Step 2: Initialize Y
93 Y = rng.randn(n, n_components) * 1e-4
94 Y_prev = Y.copy()
95 momentum = 0.5
96 gains = np.ones_like(Y)
97 kl_history = []
98
99 # Step 3: Gradient descent with momentum
100 print("Optimizing embedding...")
101 for t in range(1, n_iter + 1):
102 # Early exaggeration phase
103 P_eff = P * early_exaggeration if t <= exaggeration_iter else P
104 if t == exaggeration_iter + 1:
105 momentum = 0.8 # switch to higher momentum after exaggeration
106
107 Q, Q_num = student_t_affinities(Y)
108 grad = tsne_gradient(P_eff, Q, Q_num, Y)
109
110 # Adaptive learning rate per dimension (simplified)
111 gains = (gains + 0.2) * ((grad > 0) != (Y - Y_prev > 0)) + (gains * 0.8) * ((grad > 0) == (Y - Y_prev > 0))
112 gains = np.maximum(gains, 0.01)
113
114 Y_new = Y - learning_rate * gains * grad + momentum * (Y - Y_prev)
115 Y_prev = Y.copy()
116 Y = Y_new
117
118 # Track KL divergence
119 if t % 100 == 0:
120 kl = np.sum(P * np.log(P / Q))
121 kl_history.append((t, kl))
122 print(f" Iter {t:4d} | KL divergence: {kl:.4f}")
123
124 return Y, kl_history
125
126
127# ── Demo ──────────────────────────────────────────────────────────────────────
128from sklearn.datasets import load_digits
129from sklearn.preprocessing import StandardScaler
130from sklearn.decomposition import PCA
131
132digits = load_digits()
133X = digits.data # (1797, 64)
134y = digits.target
135
136# Preprocess: scale then PCA to 30 dims (standard practice)
137X_scaled = StandardScaler().fit_transform(X)
138X_pca = PCA(n_components=30, random_state=42).fit_transform(X_scaled)
139
140# Run t-SNE on a subset for speed (our implementation is O(n^2))
141idx = np.random.RandomState(0).choice(len(X_pca), 500, replace=False)
142embedding, kl_hist = tsne(X_pca[idx], perplexity=30, n_iter=1000,
143 learning_rate=200, random_state=42)
144
145print(f"\nFinal embedding shape: {embedding.shape}")
146print(f"Final KL divergence: {kl_hist[-1][1]:.4f}")
147# Expect: 10 distinct clusters (digits 0-9) visible in a scatter plotSample Input
X = digits.data # shape (1797, 64) — pixel intensities of handwritten digits y = digits.target # shape (1797,) — digit labels 0-9
Sample Output
embedding shape: (1797, 2) KL divergence at convergence: ~0.85 Plot shows 10 distinct clusters, one per digit class, with digit '1' and '7' sometimes overlapping (visually similar strokes)
Key Implementation Insights
- →Always apply PCA (n_components=50) before t-SNE when d > 50. This reduces runtime significantly and often improves embedding quality by denoising.
- →Run t-SNE multiple times with different random seeds. If the same cluster structure appears consistently, it reflects real structure. If it changes every run, the structure may be an artifact.
- →init='pca' is more stable than init='random' and typically requires fewer iterations to converge.
- →Cluster sizes and inter-cluster distances in t-SNE are NOT interpretable. Two large clusters and one small cluster does NOT mean the small cluster is less important.
- →The kl_divergence_ attribute after fitting is a convergence diagnostic — if it's still decreasing rapidly, increase n_iter.
- →For very large datasets (n > 50K), use method='barnes_hut' (default) and consider UMAP as a faster, often better alternative.
Common Implementation Mistakes
- ✗Not applying PCA before t-SNE on high-d data — dramatically slows computation and adds noise.
- ✗Running t-SNE only once and trusting the result — always run multiple times to check stability.
- ✗Interpreting cluster sizes as cluster importance — t-SNE distorts cluster sizes arbitrarily.
- ✗Using t-SNE coordinates as input features to a classifier — t-SNE has no transform() for new data and is stochastic.
- ✗Comparing inter-cluster distances across different t-SNE runs or perplexity settings — distances are not preserved.
- ✗Setting perplexity larger than n/3 — this violates t-SNE's assumptions and produces degenerate results.
High-dimensional embeddings (NLP, vision)
t-SNE was designed for exactly this: visualizing dense, high-dimensional feature spaces like word embeddings, BERT representations, or CNN activations. The Gaussian affinities capture semantic similarity beautifully.
Genomics / single-cell RNA-seq
Single-cell biology is t-SNE's most prominent scientific application. Thousands of genes per cell compressed to 2D for cell-type visualization. t-SNE is considered standard in computational biology.
Small-medium tabular (n < 10K)
Works well for exploring structure in tabular data with many features. Not necessary if d is already small (< 10) — just plot directly.
Large dataset (n > 100K)
Even Barnes-Hut t-SNE is O(n log n) in time and O(n) in memory, making it slow at this scale. A 1M-point t-SNE can take hours.
Imbalanced multi-class data
t-SNE can reveal class imbalance visually — rare classes appear as small isolated clusters. But cluster SIZE in t-SNE is not proportional to class size.
Time-series / sequential data
t-SNE does not understand temporal order — it treats each time step as an independent point. Trajectory structure is lost.
Mandatory Visual Blueprint
What should move
At least one parameter, threshold, split, cluster state, or metric should change interactively.
What to observe
The learner should see how the concept affects error, fit, grouping, or decision quality.
Planned visual type
Interactive chart, step animation, or side-by-side failure-mode comparison.
Reference image slot
If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.
Topic key: tsne
KL Divergence Convergence Curve
t-SNE minimizes KL(P‖Q) via gradient descent. The curve should drop steeply in the early exaggeration phase (first ~250 iterations), then continue decreasing more gradually. A flattened curve indicates convergence. If KL is still dropping at n_iter, increase the iteration count.
Gradient descent convergence — MSE decreasing over iterations
Perplexity Effect on Cluster Structure
Low perplexity (5) creates many small, tight, fragmented clusters. High perplexity (50) creates fewer, larger, more connected groups — more global structure. The 'right' perplexity reveals the structure you're looking for. Always compare multiple settings.
Advantages
Exceptional cluster visualization
t-SNE consistently produces visually striking, well-separated cluster plots for real-world high-dimensional data. No other standard technique matches its ability to reveal non-linear cluster structure in 2D.
Handles non-linear structure
Unlike PCA (which only captures linear variance), t-SNE can reveal manifold structure, curved clusters, and complex topologies that linear methods completely miss.
Adaptive bandwidth (perplexity)
The per-point sigma adaptation means t-SNE handles datasets with varying local densities gracefully. Dense and sparse regions are both represented with appropriate neighborhood sizes.
No assumptions about cluster shape
t-SNE makes no assumptions that clusters are spherical, convex, or linearly separable. It can reveal elongated, curved, or nested clusters.
Solves the crowding problem
The Student-t distribution's heavy tails provide the extra repulsion needed to prevent all points from collapsing into the center of the 2D space — a fundamental limitation of the earlier SNE algorithm.
Scientifically validated and widely adopted
t-SNE has been cited over 15,000 times and is the de facto visualization standard in computational biology, NLP, and computer vision research.
Limitations
Non-deterministic and run-dependent
With random initialization, each t-SNE run produces a different layout. Two valid t-SNE plots of the same data can look completely different. You must run multiple times and look for consistent structure.
Does not preserve global distances
The distance between two clusters in a t-SNE plot is meaningless. Cluster A being 'closer' to cluster B than to cluster C in 2D says nothing about true high-d distances between the clusters.
No out-of-sample extension
t-SNE has no transform() method. To embed a new point, you must refit the entire model on the extended dataset. This makes it impossible to use in production pipelines that require embedding new data.
Quadratic computational complexity
Exact t-SNE is O(n²) in both time and memory. Even Barnes-Hut approximation is O(n log n). For n > 100K, runtime becomes prohibitive and UMAP is the practical choice.
Cluster sizes are meaningless artifacts
The size of a cluster in t-SNE output does not correspond to the number of points in the cluster or its spread in high-d space. This is one of the most commonly misinterpreted aspects of t-SNE.
Sensitive to hyperparameter choices
Different perplexity values can make the same dataset look like it has 5 or 20 clusters. Without domain knowledge, it can be unclear which structure is 'real'. Always report hyperparameters used.
Cannot be used as a preprocessing step
Because t-SNE is stochastic, non-invertible, and distorts distances, using its output as features for a classifier is a bad practice. Use PCA or autoencoders for preprocessing.
Single-cell RNA-seq cell type visualization
Each cell is a high-dimensional gene expression vector. t-SNE projects thousands of cells to 2D, revealing distinct cell-type clusters (T-cells, B-cells, stem cells) without knowing labels in advance. Standard in papers from major biology journals.
Word and sentence embedding visualization
Visualizing word2vec, GloVe, or BERT embeddings in 2D. Semantically similar words cluster together — king/queen near royalty words, countries near their capitals. Used to validate embedding quality.
CNN feature space exploration
Take activations from the penultimate layer of a trained ResNet and visualize with t-SNE. Images of the same category should cluster together. Useful for debugging misclassifications: points that appear in the wrong cluster are likely to be confused by the classifier.
Anomaly cluster visualization before labeling
Transaction embeddings visualized with t-SNE. Fraud transactions often form isolated clusters or appear at the periphery. t-SNE helps analysts identify which unlabeled clusters to prioritize for manual review.
Chemical space visualization
Molecular fingerprints (bit vectors of chemical substructures) projected to 2D. Compounds with similar biological activity cluster together. Guides lead optimization by showing which chemical neighborhoods have been explored.
Customer segmentation validation
After computing customer feature embeddings (RFM, behavioral), t-SNE verifies that K-Means cluster assignments make sense: do customers labeled as 'high value' form a tight distinct cluster? Used before presenting segments to business stakeholders.
t-SNE is one of several dimensionality reduction techniques. The right choice depends on your goal: visualization, preprocessing, or compression.
PCA (Principal Component Analysis)
Similarity
Both reduce dimensionality; both can produce 2D outputs for visualization
Key Difference
PCA is linear, deterministic, fast, and preserves global variance. t-SNE is non-linear, stochastic, slow, and preserves local neighborhood structure. PCA has a transform() for new data; t-SNE does not.
Choose When
PCA for preprocessing, feature reduction, or when you need an interpretable linear transformation. t-SNE for visualization when you suspect non-linear cluster structure.
UMAP (Uniform Manifold Approximation and Projection)
Similarity
Both are non-linear, neighborhood-preserving, and produce 2D/3D visualizations
Key Difference
UMAP is much faster (O(n) after graph construction), better preserves global structure than t-SNE, supports out-of-sample embedding (transform()), and is more theoretically grounded (Riemannian geometry). t-SNE is older, more widely cited, and arguably produces tighter clusters for purely local visualization.
Choose When
UMAP for large datasets (n > 50K), when you need transform() for new data, or when global structure matters. t-SNE for publication-quality 2D cluster visualizations on smaller datasets.
Autoencoders
Similarity
Both learn non-linear low-dimensional representations
Key Difference
Autoencoders are trained to reconstruct input — they learn a 2D bottleneck that captures the information needed for reconstruction, not neighborhood structure. They have an encoder for new data. t-SNE optimizes purely for neighborhood preservation in 2D.
Choose When
Autoencoders for preprocessing (their latent space can feed into classifiers), generative modeling, or when you need invertible compression. t-SNE only for visualization.
Isomap
Similarity
Both are non-linear dimensionality reduction methods
Key Difference
Isomap preserves geodesic (manifold) distances — it's globally distance-preserving on manifolds. t-SNE only preserves local neighborhoods. Isomap is deterministic; t-SNE is stochastic. Isomap fails on non-convex manifolds with holes.
Choose When
Isomap when the data lies on a smooth convex manifold and global distance preservation matters. t-SNE for cluster visualization without distance constraints.
| Property | t-SNE | PCA | UMAP | Autoencoder |
|---|---|---|---|---|
| Preserves local structure | ✓ Excellent | Partial | ✓ Good | Partial |
| Preserves global structure | ✗ No | ✓ Yes | Partial | Partial |
| Out-of-sample transform | ✗ No | ✓ Yes | ✓ Yes | ✓ Yes |
| Deterministic | ✗ No | ✓ Yes | ✗ No | ✗ No |
| Scalability (n > 100K) | ✗ Slow | ✓ Fast | ✓ Fast | ✓ Fast |
| Interpretable axes | ✗ No | ✓ Yes (PCs) | ✗ No | ✗ No |
| Best for visualization | ✓ Best | Limited | ✓ Good | Limited |
Choose t-SNE when:
You want the best possible 2D cluster visualization for a dataset with n < 50K, you have time to run multiple seeds, and you don't need to embed new points or use the result as ML features.
KL Divergence (Convergence Diagnostic)
The optimization objective. Lower means the 2D embedding better matches the high-d neighborhood structure. Access via tsne.kl_divergence_ after fit. Used only to confirm convergence — lower is better but absolute values are not comparable across datasets.
Target: Depends on dataset; check that it has plateaued
Trustworthiness
Measures how often k-nearest neighbors in the embedding were also neighbors in high-d. T=1: embedding perfectly preserves local structure. T<0.9: many false neighbors introduced. Available in sklearn.manifold.trustworthiness.
Target: > 0.90
Continuity
Complement of trustworthiness: measures how often true high-d neighbors are preserved in the embedding (vs. trustworthiness which checks for false neighbors). Use both together for a complete picture.
Target: > 0.90
Silhouette Score (on embedding)
If you have labels, compute silhouette score on the 2D t-SNE coordinates using those labels. High score means labeled clusters are well-separated in the embedding — a good sign that t-SNE has captured meaningful structure. Not a quality metric for t-SNE itself, but for the downstream clustering.
Target: > 0.5 for well-separated clusters
Evaluation Process
- 01.1. Verify convergence: check tsne.kl_divergence_ — compare to earlier iterations (verbose=1).
- 02.2. Run with 3 different random_state values — consistent cluster topology means real structure.
- 03.3. Compute trustworthiness with sklearn.manifold.trustworthiness(X_pca, embedding, n_neighbors=10).
- 04.4. Color by known labels or metadata — does t-SNE cluster agree with ground truth?
- 05.5. Try 3 perplexity values (5, 30, 50) — consistent clusters across settings are more credible.
- 06.6. If you have clusters, compute silhouette score on the embedding to quantify separation.
Evaluation Traps
- ▸Reporting t-SNE as evidence of a specific number of clusters — the number of visual clusters depends on perplexity, not just the data.
- ▸Comparing t-SNE plots from different runs without the same perplexity and n_iter — they are not comparable.
- ▸Using t-SNE coordinates as features in a downstream model — they are stochastic and have no out-of-sample meaning.
- ▸Concluding that two clusters are 'far apart' in high-d because they are far in the t-SNE plot — inter-cluster distances are not preserved.
- ▸Publishing a t-SNE without reporting perplexity, n_iter, and number of random seeds tested — the plot cannot be reproduced or interpreted.
Real-World Interpretation Example
Single-cell RNA-seq (3000 cells, 2000 genes): after PCA(50) and t-SNE(perplexity=30, n_iter=2000), the plot shows 8 distinct tight clusters. Colored by cell-type marker expression, cluster 3 is high for CD4 (T-helper cell marker) and cluster 7 is high for CD8 (cytotoxic T-cell marker). KL divergence: 1.23 (plateaued after 1500 iterations). Trustworthiness at k=10: 0.94. Consistent structure confirmed across 5 random seeds. Conclusion: the 8 clusters are real biological cell types, not t-SNE artifacts.
Students
- ×Believing cluster sizes in t-SNE represent the relative sizes of those clusters in real data.
- ×Running t-SNE once and reporting the result without checking stability across random seeds.
- ×Not applying PCA before t-SNE on high-dimensional data — adds noise and greatly slows computation.
- ×Thinking the x-axis and y-axis of a t-SNE plot have meaningful units or directions — they don't.
- ×Using t-SNE for preprocessing (feeding its output to a classifier) — it's visualization-only.
Developers
- ×Not setting random_state — results differ every run, making debugging impossible.
- ×Setting perplexity too high (close to n) or too low (< 5) without understanding the effect.
- ×Using t-SNE on the raw high-d features when PCA pre-reduction would have improved both quality and speed.
- ×Not checking whether KL divergence has converged (increase n_iter if it hasn't plateaued).
- ×Calling tsne.transform(X_new) — t-SNE has no transform method; it will raise AttributeError.
In Interviews
- ×Saying 't-SNE preserves distances' — it preserves neighborhoods, not distances. Global distances between clusters are meaningless.
- ×Not knowing what perplexity controls or its typical range (5–50).
- ×Saying t-SNE is 'better than PCA' without qualification — they have completely different purposes.
- ×Not mentioning that t-SNE is non-deterministic when discussing reproducibility.
- ×Claiming t-SNE can be used to embed new test points — it cannot without retraining.
Real Projects
- ×Publishing t-SNE plots without reporting hyperparameters (perplexity, n_iter, random seed) — not reproducible.
- ×Drawing conclusions about inter-cluster biological distance from t-SNE plots — distances between clusters are not meaningful.
- ×Using t-SNE to 'prove' the number of clusters — the visible cluster count depends heavily on perplexity.
- ×Applying t-SNE to n > 200K without switching to UMAP or subsampling — will run for hours or run out of memory.
- ×Not doing a sanity-check with multiple perplexity values — a finding is only credible if it persists across settings.
What kind of bias does this model have?
Bias depends on distance and shape assumptions in feature space.
What kind of variance does it have?
Variance increases when cluster structure is unstable or high-dimensional.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers scaled features with meaningful geometric distance.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- t-SNE converts high-d pairwise distances to probabilities (Gaussian P) and minimizes KL(P‖Q) in 2D
- Low-d similarities Q use Student-t (1 df) — heavy tails solve the crowding problem vs. SNE's Gaussian
- Perplexity controls effective neighbor count; typical range 5–50; try multiple values
- Early exaggeration (first ~250 iters) multiplies P by 12 to form well-separated initial clusters
- KL divergence objective: heavily penalizes mapping true neighbors far apart; lenient about mapping non-neighbors close
- Cluster sizes and inter-cluster distances in t-SNE output are NOT meaningful — only local topology is
- Always: scale → PCA(50) → t-SNE; run multiple seeds; check convergence (kl_divergence_)
- No out-of-sample transform() — cannot embed new points without refitting
- For n > 50K, use UMAP instead
Critical Formulas
Best For
- ✓2D/3D visualization of high-dimensional cluster structure
- ✓Validating embeddings (NLP, vision, genomics)
- ✓Exploratory data analysis before clustering
- ✓Presenting intuitive visual summaries of complex data
Avoid When
- ✗n > 100K (use UMAP)
- ✗You need to embed new points (use PCA or UMAP)
- ✗You need reproducible deterministic output
- ✗You need to preserve global distances between clusters
- ✗You're preprocessing for ML (use PCA or autoencoders)
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.