ML Atlas

UMAP

Topology-powered manifold learning that preserves local and global structure — faster and more faithful than t-SNE.

IntermediateUnsupervisedDim. Reduction
38 min read
Linear algebra: matrix operations, distances, dot productsBasic topology concepts: open sets, neighborhoods (optional but helpful)Probability: probability distributions, entropyFamiliarity with t-SNE or PCA is helpful but not required
  • Single-cell RNA sequencing visualization (standard in scRNA-seq pipelines via Seurat, Scanpy)
  • Drug discovery: projecting molecular embeddings to explore chemical space
  • NLP: visualizing word/sentence embeddings from BERT and GPT models
  • Anomaly detection: projecting network traffic logs to find outlier clusters
  • Image recognition: visualizing CNN feature spaces to understand learned representations
  • Recommendation systems: mapping user/item embeddings to understand preference clusters
01

In Plain English

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that takes high-dimensional data and compresses it into 2 or 3 dimensions while preserving the relationships between data points — both the tight local clusters and the broader global arrangement. It does this by modeling the data as a fuzzy topological structure and then finding a low-dimensional representation with a matching topological structure.

Why It Exists

t-SNE was the dominant non-linear visualization method, but it has serious limitations: it is extremely slow on large datasets, it destroys global structure (clusters are placed arbitrarily), and it cannot be used to transform new data points. UMAP was developed in 2018 by McInnes, Healy, and Melville to address all of these limitations using a rigorous mathematical foundation from algebraic topology and Riemannian geometry.

Problem It Solves

Given a dataset with d features (often hundreds or thousands), find a 2D or 3D representation where points that are similar in the original high-dimensional space remain neighbors in the low-dimensional space — preserving not just which points are nearby (local structure) but also the relative arrangement of clusters (global structure) — all within seconds rather than hours.

Real-Life Analogy

"Imagine you have a massive library where every book is connected to its most similar books by colored threads. UMAP first maps out this network of relationships — which books form tight clusters, which clusters are related to each other, how isolated certain topics are. Then it recreates this entire relationship map on a 2D floor plan, placing books so nearby books remain neighbors and the spatial layout of groups mirrors the original topic structure. t-SNE makes a beautiful floor plan of each island in isolation but arranges the islands randomly. UMAP tries to also get the map of the whole archipelago right."

When To Use

  • Large datasets (> 10,000 samples) where t-SNE is too slow
  • When you need to transform new/unseen data points using the learned projection
  • When global structure (relative cluster positions, cluster sizes) matters for interpretation
  • Exploratory data analysis of high-dimensional embeddings (NLP, images, genomics)
  • Pre-processing before clustering algorithms — UMAP projections often improve k-means and HDBSCAN results
  • When you need a reproducible layout (set random_state) for reporting or publishing

When NOT To Use

  • When you need a fully interpretable, invertible linear transformation (use PCA)
  • When exact distances in the embedding are critical — UMAP distances are not directly interpretable
  • Very small datasets (< 50 samples) — n_neighbors must be smaller than n_samples, and manifold estimates are unreliable
  • When computational resources are extremely constrained and linear PCA suffices
  • When you need a probability model of the data (use VAE or GMM instead)
02

Every dataset, no matter how many dimensions it lives in, tends to cluster on a lower-dimensional surface — a manifold. Consider 1,000-pixel face images: there are vastly fewer face configurations than the 1,000-dimensional pixel space allows. The actual face data lives on a curved surface (a manifold) embedded in that high-dimensional space. UMAP's goal is to find and flatten this manifold into 2D.

To do this, UMAP first builds a model of the manifold using fuzzy set theory. For each data point, it finds its k nearest neighbors and constructs a 'fuzzy' connection to them — points that are very close get a connection strength near 1, points at the boundary of the neighborhood get a strength near 0, and points beyond k neighbors get 0. The result is a weighted graph called the fuzzy simplicial complex, a topological object that captures the shape of the manifold. Crucially, UMAP normalizes these connection strengths so that each point's neighborhood has the same total 'volume' — this is the Riemannian metric adaptation that makes UMAP uniform.

Then UMAP initializes a low-dimensional layout (usually via spectral embedding or PCA) and optimizes it using a cross-entropy loss between the high-dimensional fuzzy graph and an equivalent fuzzy graph built in the low-dimensional space. Unlike t-SNE's KL divergence, this cross-entropy has both an attractive force (pulling neighbors together) and a repulsive force (pushing non-neighbors apart). The optimization uses stochastic gradient descent with negative sampling, which is why UMAP is orders of magnitude faster than t-SNE.

The Metaphor

"Imagine a crumpled piece of paper (your high-dimensional manifold). UMAP first traces all the connections between nearby points — even around curves and folds. Then it carefully unfolds the paper flat, preserving the neighborhood connections as faithfully as possible. t-SNE does something similar but cuts the paper into islands and doesn't care where the islands land relative to each other. UMAP tries to unfold the whole sheet without cutting."

Beginner Mental Model

Think of UMAP as building a social network graph of your data: each point is friends with its k nearest neighbors, with friendship strength based on distance. Then UMAP recreates this social network in 2D — friends stay near friends. The 'uniform' part means UMAP adjusts for local density, so sparse regions aren't over-compressed and dense regions aren't over-expanded.

03

UMAP constructs a fuzzy topological representation of the data. For each point xᵢ, it finds k nearest neighbors and computes a local metric ρᵢ (distance to nearest neighbor) and σᵢ (bandwidth chosen so that Σⱼ exp(-(d(xᵢ,xⱼ) - ρᵢ)/σᵢ) = log₂(k)). High-dimensional edge weights are: wᵢⱼ = exp(-(d(xᵢ,xⱼ) - ρᵢ)/σᵢ). The symmetric fuzzy union is: w̃ᵢⱼ = wᵢⱼ + wⱼᵢ - wᵢⱼwⱼᵢ. In low-dimensional space, edge weights are: v̂ᵢⱼ = (1 + a·||yᵢ - yⱼ||²ᵇ)⁻¹. UMAP minimizes the cross-entropy C = -Σᵢⱼ [w̃ᵢⱼ log(v̂ᵢⱼ) + (1 - w̃ᵢⱼ) log(1 - v̂ᵢⱼ)] via stochastic gradient descent with negative sampling.

Manifold
A topological space that locally resembles Euclidean space. High-dimensional data typically lies on a low-dimensional manifold. A sphere is a 2D manifold in 3D space — its intrinsic dimensionality is 2 even though it exists in 3D.
Fuzzy Simplicial Set
A mathematical object from algebraic topology. In UMAP, it's a weighted graph where edge weights represent the probability (or degree of membership) that two points are connected in the topological sense. Values are in [0,1] — a 'fuzzy' version of a graph.
Riemannian Metric
A way of measuring distances that can vary from point to point. UMAP assumes the data manifold has a Riemannian metric — distances near dense regions are scaled differently than distances in sparse regions. This is how UMAP handles varying data density.
n_neighbors
The number of nearest neighbors used to construct the local manifold approximation. Controls the tradeoff between local and global structure. Small values emphasize local detail; large values preserve broader structure.
min_dist
Minimum distance between points in the low-dimensional embedding. Controls how tightly points cluster together. Small values pack clusters tightly; large values create a more uniform spread.
Spectral Embedding
The default initialization for UMAP — uses eigenvectors of the graph Laplacian of the fuzzy simplicial complex to place points in the low-dimensional space before optimization begins. Better than random initialization.
Cross-Entropy Loss
The objective UMAP minimizes. It measures how different the high-dimensional and low-dimensional fuzzy graphs are. Unlike t-SNE's KL divergence, cross-entropy penalizes both attractive (pulling nearby points together) and repulsive (pushing distant points apart) forces.
Negative Sampling
A trick from word2vec used in UMAP's optimization. Instead of computing repulsion against all non-neighbor pairs (expensive), UMAP samples a small number of 'negative' (non-neighbor) pairs for each update. This is the key to UMAP's speed advantage.
  1. 1. Find k nearest neighbors for each point using a fast approximate search (typically NN-descent or PyNNDescent). Store distances d(xᵢ, xⱼ).
  2. 2. Compute local metric: ρᵢ = distance to xᵢ's nearest neighbor (makes self-loops zero). Compute σᵢ by binary search to satisfy: Σⱼ exp(-(d(xᵢ,xⱼ) - ρᵢ)/σᵢ) = log₂(k).
  3. 3. Compute directed edge weights: wᵢⱼ = exp(-(max(0, d(xᵢ,xⱼ) - ρᵢ))/σᵢ). Symmetrize via fuzzy set union: w̃ᵢⱼ = wᵢⱼ + wⱼᵢ - wᵢⱼwⱼᵢ.
  4. 4. Initialize low-dimensional layout Y ∈ ℝⁿˣ² using spectral embedding of the graph Laplacian (or PCA as a fallback for large n).
  5. 5. Fit parameters a, b from min_dist using a non-linear least squares fit to (1 + a·x²ᵇ)⁻¹ — these define the low-dimensional similarity curve shape.
  6. 6. Optimize layout via stochastic gradient descent: for each epoch, sample edges proportional to w̃ᵢⱼ. Apply attractive gradient to pull yᵢ toward yⱼ. Apply repulsive gradient using n_negative_samples randomly chosen non-neighbor points.
  7. 7. Run for n_epochs (default 200 for large data, 500 for small). Learning rate decays over epochs.
  8. 8. Return final Y (n × n_components) as the low-dimensional embedding.

Numeric data matrix X ∈ ℝⁿˣᵈ. Can handle large n (millions) with approximate nearest neighbors. Also accepts precomputed distance matrices (metric='precomputed'). Features should be scaled.

Low-dimensional embedding Y ∈ ℝⁿˣᵐ where m is typically 2 or 3. The fitted UMAP object can also transform new data via transform(). Does not natively output explained variance.

01The data lies on (or near) a low-dimensional Riemannian manifold embedded in high-dimensional space.
02The manifold is locally connected — nearby points in high-dimensional space are genuinely similar, not just accidentally close.
03Local metric assumption: the data is locally uniformly distributed on the manifold (uniform manifold approximation). In practice this is violated, but UMAP corrects via the σᵢ bandwidth.
04Euclidean distance is a meaningful measure of similarity in the input space (or you specify a different metric).
05k (n_neighbors) neighbors are sufficient to capture local manifold structure — too few and you miss structure, too many and you blend distinct manifold regions.
  • n_neighbors ≥ n_samples: raises an error. With small datasets, n_neighbors must be set to less than n_samples.
  • Disconnected data components: UMAP handles disconnected graphs by separately embedding each component and then placing them with some global coherence. But distances between components are less reliable.
  • All identical points: distance is zero everywhere — UMAP produces degenerate output. Check for duplicate rows before running.
  • Very high-dimensional sparse data (e.g., bag-of-words): cosine or hellinger metric often works better than Euclidean in sparse spaces.
  • Extremely large datasets (n > 1M): use UMAP with low_memory=True and approximate neighbor search. Consider training on a subsample and using transform() for the rest.
04

UMAP sits after feature preprocessing (scaling, encoding) and before visualization or clustering. Unlike PCA, UMAP is primarily used for visualization and cluster-structure exploration rather than as a preprocessing step for supervised models — embedding distances are not directly meaningful for regression or classification. However, UMAP-reduced features can improve density-based clustering (HDBSCAN) significantly.

  • 01.Scale features: StandardScaler or MinMaxScaler before UMAP. If features are on very different scales, the distance computation is dominated by large-scale features. Exception: for count data (e.g., RNA-seq), log-normalization is preferred over standardization.
  • 02.Handle missing values: UMAP does not tolerate NaN. Impute missing values before running.
  • 03.Remove exact duplicates: duplicate rows cause degenerate nearest-neighbor graphs. Use df.drop_duplicates() first.
  • 04.For text/categorical data: encode to numeric (TF-IDF, embeddings, ordinal) before UMAP. Or use UMAP with metric='cosine' for TF-IDF vectors.
  • 05.For count data (genomics, NLP): consider log1p transform (np.log1p(X)) before UMAP to compress skewed distributions.
  • 06.For pre-computed similarities: pass metric='precomputed' and provide an n×n distance matrix directly.
  • 01.Fit UMAP on training data: reducer = umap.UMAP(...); embedding = reducer.fit_transform(X_train).
  • 02.For reproducibility: always set random_state. UMAP's SGD has randomness from negative sampling and initialization.
  • 03.Explore n_neighbors first: start with 15 (default), try 5 and 50 to understand local vs. global tradeoff.
  • 04.Tune min_dist: try 0.0 (tightest clusters), 0.1 (default), and 0.5 (more spread). Depends on whether you want tight clusters or a smooth layout.
  • 05.Use transform() for new data: reducer.transform(X_new) applies the learned manifold model to new points without refitting.
  • 06.For supervised UMAP: pass y=labels to fit_transform() — UMAP will use label information to pull same-class points together.
  • 07.Validate by overlaying known labels on the embedding: colored scatter plots should show separable, coherent clusters if structure is real.

n_neighbors

Number of nearest neighbors for local manifold approximation. The most important hyperparameter.

5–50. Default: 15. Use 2–10 for fine local structure, 30–100 for broad global structure.

min_dist

Minimum distance between embedded points. Controls cluster compactness.

0.0–0.5. Default: 0.1. Use 0.0 for very tight clusters, 0.5 for smoother layouts.

n_components

Dimensionality of the target embedding space.

2 (visualization), 3 (interactive 3D), 10–50 (preprocessing for clustering).

metric

Distance metric used in high-dimensional space.

'euclidean' (default), 'cosine' (text/sparse), 'correlation' (gene expression), 'manhattan', 'hamming' (binary data).

n_epochs

Number of training epochs for the SGD optimization.

Default: 200 for n > 10K, 500 for smaller datasets.

random_state

Seed for random number generation.

Always set to an integer (e.g., 42) for reproducibility.

  1. 1pip install umap-learn numpy scikit-learn matplotlib
  2. 2Scale your data: StandardScaler().fit_transform(X)
  3. 3Import: import umap
  4. 4Initialize: reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
  5. 5Fit and embed: embedding = reducer.fit_transform(X_scaled)
  6. 6Visualize: plt.scatter(embedding[:,0], embedding[:,1], c=labels, cmap='tab10')
  7. 7Tune: try n_neighbors in [5, 15, 30, 50] and min_dist in [0.0, 0.1, 0.5] — inspect visually
  8. 8For new data: new_embedding = reducer.transform(X_new_scaled)
05
06
python
1import numpy as np
2from sklearn.neighbors import NearestNeighbors
3from scipy.optimize import curve_fit
4from scipy.sparse import csr_matrix
5
6# ─────────────────────────────────────────────────────────────────────────────
7# NOTE: This is a pedagogical implementation — it captures UMAP's core ideas
8# but lacks the C++-level optimizations of the umap-learn library.
9# For production, always use: pip install umap-learn
10# ─────────────────────────────────────────────────────────────────────────────
11
12def compute_high_dim_weights(X: np.ndarray, n_neighbors: int):
13    """Build the fuzzy simplicial set for the high-dimensional data."""
14    n = X.shape[0]
15
16    # Step 1: Find k nearest neighbors
17    nn = NearestNeighbors(n_neighbors=n_neighbors + 1, metric="euclidean")
18    nn.fit(X)
19    distances, indices = nn.kneighbors(X)
20    # Remove self (index 0)
21    distances = distances[:, 1:]   # (n, k)
22    indices   = indices[:, 1:]     # (n, k)
23
24    # Step 2: Compute rho_i = distance to nearest neighbor
25    rho = distances[:, 0]          # (n,)
26
27    # Step 3: Compute sigma_i via binary search
28    target = np.log2(n_neighbors)
29    sigma  = np.ones(n)
30    for i in range(n):
31        lo, hi = 0.0, 1e8
32        d = distances[i]           # distances to k neighbors
33        for _ in range(64):        # binary search iterations
34            mid = (lo + hi) / 2.0
35            val = np.sum(np.exp(-(np.maximum(0, d - rho[i])) / mid))
36            if np.abs(val - target) < 1e-5:
37                sigma[i] = mid
38                break
39            if val > target:
40                hi = mid
41            else:
42                lo = mid
43            sigma[i] = mid
44
45    # Step 4: Compute directed edge weights w_ij
46    rows, cols, data = [], [], []
47    for i in range(n):
48        for j_idx, j in enumerate(indices[i]):
49            w = np.exp(-max(0, distances[i, j_idx] - rho[i]) / sigma[i])
50            rows.append(i); cols.append(j); data.append(w)
51
52    W = csr_matrix((data, (rows, cols)), shape=(n, n))
53
54    # Step 5: Symmetrize via fuzzy set union: w̃_ij = w_ij + w_ji - w_ij * w_ji
55    W_T = W.T
56    W_sym = W + W_T - W.multiply(W_T)
57    return W_sym.toarray()
58
59
60def fit_ab_params(min_dist: float):
61    """Fit the a,b parameters for the low-dimensional similarity curve."""
62    def curve(x, a, b):
63        return 1.0 / (1.0 + a * x ** (2 * b))
64
65    x = np.linspace(0, 3 * min_dist, 300)
66    # Target: 1 if x < min_dist, smooth decay otherwise
67    y = np.where(x < min_dist, 1.0, np.exp(-(x - min_dist)))
68    (a, b), _ = curve_fit(curve, x, y, p0=[1.0, 1.0])
69    return a, b
70
71
72def low_dim_weight(yi: np.ndarray, yj: np.ndarray, a: float, b: float) -> float:
73    """Compute low-dimensional similarity between two embedded points."""
74    d2 = np.sum((yi - yj) ** 2)
75    return 1.0 / (1.0 + a * d2 ** b)
76
77
78def umap_fit(X: np.ndarray, n_neighbors: int = 15, min_dist: float = 0.1,
79             n_components: int = 2, n_epochs: int = 200,
80             learning_rate: float = 1.0, random_state: int = 42):
81    """Simplified UMAP fitting — captures the algorithm structure."""
82    np.random.seed(random_state)
83    n = X.shape[0]
84
85    # Phase 1: Build high-dimensional fuzzy graph
86    print("Building high-dimensional fuzzy graph...")
87    W = compute_high_dim_weights(X, n_neighbors)
88
89    # Phase 2: Initialize embedding (random for simplicity; real UMAP uses spectral)
90    Y = np.random.uniform(-10, 10, size=(n, n_components))
91
92    # Phase 3: Fit a, b parameters
93    a, b = fit_ab_params(min_dist)
94    print(f"Fitted a={a:.3f}, b={b:.3f} for min_dist={min_dist}")
95
96    # Phase 4: SGD optimization (simplified — real UMAP uses negative sampling)
97    n_neg_samples = 5
98    lr = learning_rate
99
100    print(f"Optimizing embedding for {n_epochs} epochs...")
101    for epoch in range(n_epochs):
102        # Decay learning rate
103        lr = learning_rate * (1.0 - epoch / n_epochs)
104
105        for i in range(n):
106            for j in range(n):
107                if W[i, j] == 0:
108                    continue
109                w_ij = W[i, j]
110
111                # Attractive gradient
112                d2 = np.sum((Y[i] - Y[j]) ** 2)
113                v = 1.0 / (1.0 + a * d2 ** b)
114                # d/dY_i of -w_ij * log(v) — attractive push toward j
115                grad = (2 * a * b * d2 ** (b - 1) * w_ij * v) * (Y[i] - Y[j])
116                Y[i] -= lr * grad
117
118                # Repulsive gradient against random non-neighbors
119                for _ in range(n_neg_samples):
120                    k = np.random.randint(0, n)
121                    if k == i or W[i, k] > 0.01:
122                        continue
123                    d2k = np.sum((Y[i] - Y[k]) ** 2) + 1e-6
124                    vk = 1.0 / (1.0 + a * d2k ** b)
125                    # Repulsive: d/dY_i of -(1-0)*log(1-vk)
126                    grad_rep = (2 * b * a * d2k ** (b - 1) * (1 - w_ij) * vk**2)                                / ((1 - vk + 1e-6) * (1 + a * d2k ** b))
127                    Y[i] += lr * (Y[i] - Y[k]) * grad_rep
128
129        if (epoch + 1) % 50 == 0:
130            print(f"  Epoch {epoch + 1}/{n_epochs} complete")
131
132    return Y
133
134
135# ── Demo ──────────────────────────────────────────────────────────────────────
136if __name__ == "__main__":
137    # Use umap-learn for any real work — this is for understanding only
138    from sklearn.datasets import make_blobs
139    X, y = make_blobs(n_samples=150, centers=4, n_features=10, random_state=42)
140
141    from sklearn.preprocessing import StandardScaler
142    X_scaled = StandardScaler().fit_transform(X)
143
144    # This simplified version is very slow — use umap-learn in practice
145    # embedding = umap_fit(X_scaled, n_neighbors=10, min_dist=0.1, n_epochs=100)
146
147    # Equivalent with umap-learn (fast, production-quality):
148    import umap
149    reducer = umap.UMAP(n_neighbors=10, min_dist=0.1, n_components=2, random_state=42)
150    embedding = reducer.fit_transform(X_scaled)
151
152    import matplotlib.pyplot as plt
153    plt.figure(figsize=(7, 5))
154    scatter = plt.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap="tab10", s=30)
155    plt.colorbar(scatter, label="Cluster")
156    plt.title("UMAP: 10D → 2D"); plt.tight_layout()
157    plt.savefig("umap_demo.png", dpi=120)
158    print("Saved umap_demo.png")
The pedagogical implementation shows the three core phases: (1) building the high-dimensional fuzzy graph with per-point bandwidths σᵢ, (2) symmetrizing via fuzzy union, and (3) SGD optimization with attractive and repulsive forces. The real umap-learn library replaces the pure Python loops with C++ extensions and uses the efficient PyNNDescent library for approximate nearest neighbors, making it 10-100× faster.
X_train: shape (1437, 64) — flattened 8×8 digit images, scaled. y_train: integer labels 0–9.
train_embedding: shape (1437, 2) — 2D UMAP coordinates. Digit clusters clearly separated in 2D. reducer.transform(X_test): shape (360, 2) — test points projected into the learned space.
  • Always fit on training data only, then use reducer.transform() for test/new data. Fitting on test data leaks its distribution into the projection.
  • Set random_state for reproducibility — UMAP uses stochastic SGD. Without it, layouts will differ between runs (all valid, but not reproducible).
  • StandardScaler before UMAP is almost always necessary. UMAP's nearest-neighbor search is distance-based — unscaled features dominate.
  • n_neighbors is the most important parameter. Start with 15 and explore 5, 30, 50 to understand the data's structure at different scales.
  • UMAP distances in the embedding are NOT directly interpretable. Two clusters being 'farther apart' in the embedding does NOT reliably indicate they are more different. Only neighborhood structure is preserved.
  • For clustering applications, use n_components=10–50 (not 2) for UMAP before HDBSCAN — higher dimensions preserve more structure for the clusterer.
  • Supervised UMAP (pass y= to fit_transform) dramatically tightens same-class clusters and is useful for metric learning / embedding fine-tuning.
  • Not scaling data before UMAP — nearest neighbors computed on raw features with different scales give meaningless results.
  • Calling reducer.fit_transform(X_test) instead of reducer.transform(X_test) — refitting on test data is data leakage.
  • Interpreting distances between UMAP clusters as meaningful — UMAP is a topological method, not a metric one. Cluster size and distance in the 2D embedding do not reflect true cluster size or dissimilarity.
  • Using UMAP features directly for supervised classification — UMAP was not designed to produce features for regression/classification. Use PCA or raw features instead.
  • Forgetting random_state and being confused by different-looking layouts between runs — all are valid projections of the same manifold.
  • Applying UMAP to categorical features without encoding them numerically first.
  • Using the default metric='euclidean' for cosine-similarity data (text, high-dim sparse embeddings) — use metric='cosine' for these.
07
🚀

Large High-Dimensional Data (n > 10K, d > 100)

Excellent

UMAP's primary use case. Scales to millions of samples with approximate nearest neighbors. Handles hundreds or thousands of dimensions efficiently. Produces interpretable cluster structure in 2D.

💡 For n > 1M, use low_memory=True and consider fitting on a representative subsample then transforming the rest.
🧬

Single-Cell RNA Sequencing

Excellent

The de-facto standard for scRNA-seq visualization. Typically applied after PCA (top 50 PCs) — UMAP(X_pca) not UMAP(X_raw). Reveals cell type clusters, developmental trajectories, and rare populations.

💡 Use metric='correlation' or 'cosine' for raw count data. Apply log-normalization before PCA before UMAP.
📋

Small Dataset (n < 100)

Poor

UMAP requires n_neighbors < n_samples. With very few samples, the manifold estimate is unreliable and the embedding is dominated by initialization noise. Use PCA or t-SNE for small data.

💡 Set n_neighbors to at most n//3. Consider PCA for n < 100.
💬

Text Embeddings (BERT, GPT)

Excellent

High-dimensional dense embeddings (768D from BERT, 1536D from OpenAI) contain semantic structure that UMAP reveals beautifully. Topic clusters, sentiment gradients, and domain clusters emerge.

💡 Use metric='cosine' for sentence embeddings. Normalize vectors to unit length before UMAP for best results.
📈

Time Series Data

Context-Dependent

UMAP treats rows as independent samples — temporal ordering is ignored. If the time series is represented as feature vectors (e.g., statistical features per window), UMAP can visualize clusters of similar patterns.

💡 For sequential data, consider dynamic time warping distance: metric='dtw' (requires tslearn). Or extract features first.
🖼️

Image Data (raw pixels)

Good

Works well for moderate-resolution images. Better to apply UMAP after CNN feature extraction than on raw pixels — CNN features encode semantic similarity that pixel-level UMAP misses.

💡 For raw images > 32×32, use PCA first (top 50–100 PCs) then UMAP on PCs to speed up neighbor computation.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: umap

UMAP n_neighbors Effect

As n_neighbors increases from 5 to 50, the embedding transitions from fragmented local clusters (high n_neighbors=5) to a smoother layout showing how clusters relate globally (n_neighbors=50). Small n_neighbors reveals fine intra-cluster structure; large n_neighbors shows inter-cluster relationships.

Three side-by-side scatter plots: n_neighbors=5 (scattered, many tiny islands), n_neighbors=15 (distinct but connected clusters), n_neighbors=50 (continuous manifold with fuzzy cluster boundaries).

UMAP min_dist Effect

min_dist controls how tightly points pack inside clusters. At min_dist=0.0 points compress into dense dots. At min_dist=0.5 clusters spread into diffuse clouds. The global arrangement (which cluster is next to which) remains stable across min_dist settings.

Comparison visualization data is documented in this section.

UMAP vs t-SNE: Runtime Scaling

UMAP scales approximately O(n^1.14) while t-SNE scales O(n log n) for the Barnes-Hut approximation but with a much larger constant. For 100K samples UMAP takes ~60s; t-SNE takes ~20 minutes. At 1M samples, t-SNE is infeasible while UMAP completes in ~10 minutes.

Gradient descent convergence — MSE decreasing over iterations

09
  • Preserves both local and global structure

    Unlike t-SNE, which only preserves local neighborhood relationships and places clusters arbitrarily, UMAP's cross-entropy objective has explicit repulsive forces that maintain the relative arrangement of clusters. Clusters that are genuinely distant in high-dimensional space tend to appear more separated in UMAP layouts.

  • Fast — scales to millions of samples

    UMAP uses approximate nearest neighbor search (PyNNDescent) and negative sampling in SGD, making it 10-100x faster than t-SNE for large datasets. Running in O(n^1.14) time, UMAP handles 1M+ samples in minutes where t-SNE would take days.

  • Can transform new data points

    The fitted UMAP object has a transform() method that projects new, unseen data into the learned embedding space. t-SNE has no equivalent — every run is a complete re-computation. This makes UMAP usable in inference pipelines for online visualization or anomaly detection.

  • Supports any distance metric

    UMAP accepts a metric parameter: euclidean, cosine, manhattan, correlation, hamming, jaccard, precomputed (distance matrix), and custom functions. This makes it applicable to genomics (correlation metric), text (cosine), binary data (jaccard), and custom domains.

  • Deterministic with random_state

    Setting random_state produces identical embeddings across runs. This is critical for reproducible research, consistent dashboards, and comparing embeddings before and after data updates.

  • Supports supervised and semi-supervised modes

    Passing y= (labels) to fit_transform() enables supervised UMAP, which pulls same-class points together and separates different classes. A hybrid mode (partial labels) achieves metric learning that t-SNE or PCA cannot.

  • Embedding distances are not interpretable

    UMAP preserves topology (which points are neighbors), not metric distances. Two clusters appearing far apart in the 2D embedding does not reliably mean they are more different than two nearby clusters. Never interpret absolute embedding coordinates or inter-cluster distances as meaningful quantities.

  • Stochastic — different runs produce different-looking layouts

    Without random_state, each fit produces a different valid embedding (all topologically equivalent but visually distinct). This confuses practitioners who expect a unique answer. Always set random_state in any production or publication context.

  • Hyperparameter sensitive — requires tuning

    n_neighbors and min_dist dramatically change the visual appearance. There is no single 'correct' setting — the right choice depends on the data and analysis goal. Multiple runs with different settings are needed to build trust in the structure.

  • No native explained variance metric

    Unlike PCA, UMAP has no explained variance ratio or reconstruction error. There is no quantitative measure of how much structure is preserved — you can use trustworthiness or continuity metrics from sklearn, but they are less interpretable than PCA's EVR.

  • Slower than PCA, not a general preprocessing step

    UMAP takes seconds to minutes even on modest datasets. PCA takes milliseconds. UMAP cannot replace PCA as a general preprocessing step for supervised models — it should be used for visualization and cluster exploration, not as a feature extractor for regression/classification.

  • Black-box manifold — no component interpretability

    UMAP dimensions 1 and 2 have no interpretable meaning (unlike PCA's principal components). You cannot examine 'UMAP-1 loadings' to understand which features drive the layout. The manifold structure is implicit in the optimization.

10
Genomics / Bioinformatics

Single-cell RNA sequencing cell type visualization

Standard scRNA-seq analysis pipeline: normalize → log-transform → PCA (50 PCs) → UMAP(n_components=2). UMAP reveals discrete cell types, developmental trajectories, and rare cell populations in a scatter plot. Tools: Scanpy (Python) and Seurat (R) both use UMAP as the default visualization method.

NLP / Machine Learning

Embedding space exploration for language models

Project 768D BERT embeddings or 1536D OpenAI embeddings to 2D with UMAP. Reveals semantic clusters, cross-lingual alignment, and topic structure. Used to debug embedding quality, identify domain gaps, and understand model behavior before fine-tuning.

Drug Discovery / Chemistry

Chemical space visualization and diversity analysis

Molecular fingerprints (2048-bit binary vectors) are projected with UMAP (metric='jaccard' for bit vectors) to map chemical space. Chemists identify structurally diverse compounds, visualize SAR (structure-activity relationships), and guide library design.

Cybersecurity

Network intrusion detection and anomaly visualization

High-dimensional network flow features (packet sizes, timing, ports) are embedded with UMAP for real-time anomaly detection dashboards. Normal traffic forms a compact cluster; attacks appear as outliers or new clusters. UMAP's transform() enables online monitoring.

Computer Vision

CNN feature space debugging and active learning

Features from a CNN's penultimate layer (512–2048D) are projected with UMAP to visualize which images the network considers similar. Reveals failure modes (misclassified images clustering together), guides active learning sample selection, and helps explain model decisions.

11

UMAP is most often compared to t-SNE (its predecessor for visualization) and PCA (the standard linear method). Understanding the precise tradeoffs determines when to use each.

t-SNE

Both are non-linear methods for visualizing high-dimensional data in 2D/3D

t-SNE uses Gaussian kernel in high-dim and Student-t kernel in low-dim, minimizing KL divergence. It has no repulsive term balancing global structure — clusters are placed arbitrarily. t-SNE is slower (O(n log n) with BH), has no transform() for new data, and produces different layouts for different perplexity values.

t-SNE when you only care about tight local cluster structure and have < 50K samples. UMAP in all other cases — it is faster, more informative about global structure, and can transform new points.

PCA

Both reduce dimensionality from d to k (typically 2-3 for visualization)

PCA is linear, deterministic, fast, and interpretable (components have loadings). UMAP is non-linear, stochastic, and captures curved manifold structure PCA cannot. PCA is invertible with a computable reconstruction error; UMAP is not. PCA scales to any k; UMAP is best at 2-3.

PCA when you need linear compression, interpretability, invertibility, or a preprocessing step for supervised models. UMAP when the data has non-linear manifold structure and visualization is the goal.

TSNE (sklearn)

Same conceptual goal as umap-learn's UMAP

sklearn's t-SNE implementation cannot transform new data and must recompute on any new dataset. UMAP's transform() projects new points without refitting.

Use UMAP whenever you need to project new data, need speed, or want global structure. Use t-SNE only for historical comparison or when a specific publication requires it.

Autoencoder (Variational)

Both learn non-linear compressed representations

VAEs learn a generative model with a probabilistic latent space. They require training data, architecture choices, and significant compute. UMAP is training-free (no neural network), faster, and more reliably produces interpretable 2D visualizations. VAEs can generate new data; UMAP cannot.

VAE when you need generation, interpolation, or a principled probabilistic latent space. UMAP for fast exploration and visualization of existing data.

PropertyUMAPt-SNEPCAVAE
Non-linear✓ Yes✓ Yes✗ No✓ Yes
Preserves global structure✓ Partial✗ No✓ Variance-basedDepends
Scalability (n > 100K)✓ Yes✗ Slow✓ Yes✓ Yes
transform() for new data✓ Yes✗ No✓ Yes✓ Yes
Interpretable components✗ No✗ No✓ YesPartial
Reproducible (same seed)✓ Yes✓ Yes✓ Yes✓ Yes
Supervised mode✓ YesPartial✗ No (→LDA)✓ Yes
Requires training✗ No NN✗ No NN✗ No NN✓ Neural Net

Data has non-linear manifold structure, you need fast visualization of large datasets, you need to project new data into the learned embedding, or global cluster arrangement matters alongside local cluster tightness.

12

Trustworthiness

Measures how many k-nearest neighbors in the embedding were not neighbors in the original space (false positives). Values close to 1 mean the embedding rarely introduces fake neighborhoods. Compute with sklearn.manifold.trustworthiness.

Target: > 0.90 is good; > 0.95 is excellent

Continuity

Measures how many k-nearest neighbors in the original space were lost in the embedding (false negatives). High continuity means the embedding doesn't tear apart genuine neighborhoods.

Target: > 0.90. Trustworthiness and continuity together give a balanced view of embedding quality.

Cluster Separation (Silhouette Score)

After UMAP + clustering, compute silhouette score on the UMAP embedding. Values close to 1 indicate well-separated clusters. Used to compare n_neighbors settings — higher silhouette = better cluster structure revealed.

Target: > 0.5 indicates meaningful cluster structure

Linear Discriminant Accuracy (Label Separation)

Train a simple linear classifier (e.g., LogisticRegression) on the 2D UMAP embedding. If known labels separate well linearly, the embedding preserves class structure. A useful sanity check when ground truth labels are available.

Target: Should approach the best non-linear classifier accuracy on original data if embedding is good

  1. 01.1. Visually inspect the 2D scatter plot with known labels or metadata overlaid — are related points near each other?
  2. 02.2. Compute trustworthiness at k = n_neighbors using sklearn.manifold.trustworthiness(X_original, X_embedded, n_neighbors).
  3. 03.3. Try 2–3 different n_neighbors settings (5, 15, 50) and compare trustworthiness — pick the setting with best score for your k of interest.
  4. 04.4. If you have labels: check silhouette score after UMAP + k-means or HDBSCAN. Compare to silhouette on raw or PCA-reduced features.
  5. 05.5. Check for topology artifacts: isolated singletons, thin 'bridges' connecting clusters, or clusters that fragment with different random_state values.
  6. 06.6. For stability: run UMAP 3 times with different random_state seeds. If cluster membership (which points are nearest neighbors) is highly consistent, the structure is real.
  • Interpreting the absolute positions or inter-cluster distances in the UMAP embedding as meaningful — only local neighborhood structure is reliably preserved.
  • Concluding that two clusters are 'more similar' because they appear close in the UMAP — relative cluster positions in UMAP are only partially meaningful.
  • Running UMAP once and publishing the result without checking stability — run multiple seeds and confirm cluster membership is consistent.
  • Using UMAP embeddings as features for supervised regression or classification — they are optimized for topology preservation, not prediction.
  • Comparing UMAP runs with different n_neighbors directly — they are different projections and cannot be overlaid or compared point-by-point.

scRNA-seq dataset: 8,000 cells, 2,000 highly variable genes. After PCA (50 PCs) + UMAP (n_neighbors=30, min_dist=0.3): trustworthiness=0.94, continuity=0.91. Three major clusters visible in 2D plot, confirmed by marker gene expression (cluster 1 = T-cells, cluster 2 = B-cells, cluster 3 = monocytes). Silhouette score = 0.58 on UMAP embedding, vs. 0.32 on PCA embedding — UMAP better reveals cell type structure. A small isolated cluster (n=45 cells) confirmed as NK cells by CD56 marker — a rare population invisible in PCA.

13
  • ×Treating UMAP as a feature engineering step for supervised models — UMAP creates visualization embeddings, not predictive features. Train classifiers on original features or PCA features, not UMAP coordinates.
  • ×Interpreting axis values: UMAP-1 and UMAP-2 are arbitrary. Unlike PCA's PC1 (most variance), UMAP's axes have no meaning — you can rotate the entire embedding without changing its validity.
  • ×Assuming larger n_neighbors is always better — large n_neighbors blurs local structure and eventually produces a layout similar to PCA. Match n_neighbors to the scale of structure you care about.
  • ×Running UMAP without scaling data first and being confused by results dominated by one high-scale feature.
  • ×Not setting random_state and being surprised by different-looking results in production vs. development.
  • ×Calling reducer.fit_transform(X_test) instead of reducer.transform(X_test) — accidentally refitting the manifold model on test data.
  • ×Using UMAP at default n_components=2 before HDBSCAN clustering — use n_components=10 or higher for clustering applications to preserve more structure.
  • ×Not handling the metric parameter: using Euclidean distance on cosine-similarity data (embeddings, text) leads to suboptimal neighbor graphs.
  • ×Forgetting that umap-learn requires separate installation (pip install umap-learn) — it is not part of scikit-learn.
  • ×Saying 'UMAP preserves distances' — UMAP preserves topology (neighborhood relationships), not distances. A key distinction examiners test.
  • ×Claiming UMAP is deterministic — it is stochastic. Only becomes deterministic with random_state set.
  • ×Confusing n_neighbors with the number of clusters — n_neighbors controls neighborhood size for manifold estimation, not the number of output groups.
  • ×Not knowing the difference between UMAP and t-SNE regarding global structure and transform() capability — these are the most common comparative interview questions.
  • ×Publishing UMAP figures without reporting n_neighbors, min_dist, metric, and random_state — unreproducible results are a growing concern in papers using UMAP.
  • ×Using a single UMAP run to 'confirm' biological or scientific structure without stability analysis.
  • ×Applying UMAP to raw count matrices (scRNA-seq, NLP) without log-normalization — skewed distributions lead to meaningless neighbor graphs.
  • ×Assuming UMAP cluster positions are comparable across different datasets or conditions — UMAP projections are not aligned across fits, so comparing embeddings from Dataset A and Dataset B directly is invalid without alignment.
14

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • UMAP builds a fuzzy topological graph of high-dim data (fuzzy simplicial set) and finds a low-dim embedding with matching topology
  • Key formula: wᵢⱼ = exp(-(d(xᵢ,xⱼ) - ρᵢ)/σᵢ), symmetrized by fuzzy union: w̃ᵢⱼ = wᵢⱼ + wⱼᵢ - wᵢⱼwⱼᵢ
  • Objective: cross-entropy between high-dim weights (w̃ᵢⱼ) and low-dim weights (v̂ᵢⱼ) — both attractive and repulsive forces
  • n_neighbors controls local vs global structure; min_dist controls cluster compactness
  • ALWAYS scale data first; ALWAYS set random_state; NEVER call fit_transform on test data
  • Embedding distances are NOT interpretable — only neighborhood structure is preserved
  • UMAP can transform new data via reducer.transform(X_new); t-SNE cannot
  • For clustering: use n_components=10–50 before HDBSCAN, not n_components=2
High-dim edge weight
Fuzzy union (symmetrize)
Low-dim similarity
UMAP loss (cross-entropy)
Bandwidth constraint
  • Visualizing large high-dimensional datasets (n > 10K)
  • Exploring cluster structure in embeddings (NLP, genomics, images)
  • Non-linear manifold data where PCA fails
  • Any setting requiring online transform() of new data points
  • Preprocessing before density-based clustering (HDBSCAN)
  • Features for supervised regression or classification (use PCA or raw features)
  • Exact distance or geometry preservation is required
  • Very small datasets (n < 100)
  • Need for interpretable embedding components (use PCA)
  • Computational resources extremely constrained — PCA is much faster
Explain fuzzy simplicial sets — UMAP builds a probabilistic graph, not a hard k-NN graph
Explain the cross-entropy loss: attractive (pull neighbors close) + repulsive (push non-neighbors apart) = preserves local AND global
Compare UMAP vs. t-SNE: UMAP is faster, has transform(), preserves global structure, uses cross-entropy not KL
n_neighbors tradeoff: small = fine local clusters, large = broader global layout
Why embedding distances are not meaningful: UMAP is topology-preserving, not metric-preserving
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.