ML Atlas

Mean Shift

Non-parametric density mode seeking — clusters emerge from the data itself.

IntermediateUnsupervisedClustering
18 min read
k-meanshierarchical-clustering
  • Image segmentation and color quantization
  • Object tracking in computer vision
  • Traffic density analysis
  • Medical image analysis (MRI/CT segmentation)
  • Geospatial hotspot detection
01

In Plain English

Mean Shift finds clusters by sliding a window toward the densest region of data points. Each point repeatedly moves toward the average of its neighbors until it converges — those convergence points become cluster centers.

Why It Exists

K-Means requires you to specify k and assumes spherical clusters. Mean Shift discovers k automatically by following the density gradient — it finds however many modes the data has.

Problem It Solves

Automatic cluster count discovery in datasets where density varies and cluster shapes are irregular.

Real-Life Analogy

"Imagine dropping marbles on a hilly landscape (probability density). Each marble rolls uphill to the nearest peak. Different peaks = different clusters."

When To Use

  • When you don't know how many clusters exist
  • Image segmentation tasks
  • Cluster shapes are not spherical
  • Small to medium datasets (N < 10k)
  • When cluster boundaries should follow density contours

When NOT To Use

  • Large datasets — O(n²) per iteration
  • Very high-dimensional data (bandwidth selection fails)
  • When clusters have uniform density
  • Real-time applications requiring speed
02

Mean Shift treats the data as samples from an unknown probability density function. The algorithm's goal is to find the modes (peaks) of this density.

A kernel function (usually Gaussian) is placed at each data point. The sum of all kernels defines the density surface. Mean Shift iteratively moves each point toward the gradient of this surface.

Points that converge to the same mode are assigned to the same cluster. The number of modes = the number of clusters, discovered automatically.

The Metaphor

"Fireflies on a dark field attracted to light sources. Each firefly moves toward the brightest nearby region. They converge in groups around the light sources — those are the clusters."

Beginner Mental Model

At each step, ask: 'What's the average position of all my neighbors within radius h?' Then move there. Repeat until you stop moving. That resting spot is a cluster center.

03

Mean Shift is a non-parametric iterative algorithm that seeks the modes of a kernel density estimate (KDE). Given bandwidth h and kernel K, the mean shift vector for point x is: m(x) = [Σ K((x-xᵢ)/h)·xᵢ] / [Σ K((x-xᵢ)/h)] - x

Bandwidth (h)
The radius of the search window — controls smoothness of density estimate
Kernel
Weighting function (usually Gaussian) that assigns influence based on distance
Mode
A local maximum in the KDE — a cluster center
Mean Shift Vector
The direction toward increasing density at a given point
KDE
Kernel Density Estimate — a smooth approximation of the data distribution
  1. Initialize: place a window of radius h at every data point
  2. Compute mean: calculate weighted average of all points within the window
  3. Shift: move the window center to this mean
  4. Repeat steps 2-3 until convergence (shift < ε)
  5. Assign cluster: group all points converging to the same mode
  6. Merge nearby modes within distance < h

Feature matrix X ∈ ℝⁿˣᵈ, bandwidth h (or auto-estimated)

Cluster labels ∈ ℤⁿ, cluster centers (modes)

01Data is sampled from a smooth probability density
02Clusters correspond to density modes
03Bandwidth h is chosen appropriately for the data scale
  • Very flat density → everything merges into one cluster
  • Too small h → every point is its own cluster
  • Too large h → clusters merge incorrectly
  • Saddle points can trap the algorithm (rare with Gaussian kernel)
04

Exploratory clustering step. Apply after feature engineering and scaling. Output cluster labels used as features or for segment analysis.

  • 01.Standard scale all features — bandwidth is distance-based
  • 02.Remove extreme outliers that create spurious density peaks
  • 03.Apply PCA if d > 20 (curse of dimensionality)
  • 04.Estimate bandwidth using sklearn's estimate_bandwidth()
  • 01.Estimate bandwidth via cross-validation or Scott's rule
  • 02.Initialize one kernel per data point
  • 03.Iterate mean shift updates until convergence
  • 04.Merge cluster centers within distance h/2 of each other
  • 05.Label each input point by its convergence mode

bandwidth

Search window radius — most critical parameter

Estimated via estimate_bandwidth(); try h = σ·n^(-1/(d+4))

kernel

Weighting function for neighbor influence

Gaussian (default and most common)

  1. 1Scale data with StandardScaler
  2. 2Estimate bandwidth with estimate_bandwidth(X, quantile=0.2)
  3. 3Fit MeanShift(bandwidth=bw)
  4. 4Inspect cluster_centers_ and labels_
  5. 5Validate with silhouette score
05
06
python
1import numpy as np
2
3def gaussian_kernel(dist, bandwidth):
4    return np.exp(-0.5 * (dist / bandwidth) ** 2)
5
6def mean_shift(X, bandwidth, max_iter=300, tol=1e-3):
7    points = X.copy().astype(float)
8
9    for _ in range(max_iter):
10        new_points = np.zeros_like(points)
11        for i, point in enumerate(points):
12            dists = np.linalg.norm(X - point, axis=1)
13            weights = gaussian_kernel(dists, bandwidth)
14            new_points[i] = (weights[:, None] * X).sum(0) / weights.sum()
15
16        shift = np.linalg.norm(new_points - points, axis=1).max()
17        points = new_points
18        if shift < tol:
19            break
20
21    # Merge nearby modes
22    modes = []
23    labels = -np.ones(len(X), dtype=int)
24    for i, pt in enumerate(points):
25        close = [j for j, m in enumerate(modes) if np.linalg.norm(pt - m) < bandwidth / 2]
26        if not close:
27            modes.append(pt)
28            labels[i] = len(modes) - 1
29        else:
30            labels[i] = close[0]
31
32    return np.array(modes), labels
33
34# Example
35from sklearn.datasets import make_blobs
36X, _ = make_blobs(n_samples=200, centers=3, random_state=42)
37centers, labels = mean_shift(X, bandwidth=1.5)
38print(f"Found {len(centers)} clusters")
Pure NumPy implementation showing the kernel weighting and mode-merging logic clearly.
X = [[1.0, 2.0], [1.5, 1.8], [5.0, 8.0], [8.0, 8.0], [1.0, 0.6], [9.0, 11.0]]
labels = [0, 0, 1, 1, 0, 1], cluster_centers_ = [[1.17, 1.47], [7.33, 9.0]]
  • estimate_bandwidth() with quantile=0.1-0.3 is a reliable starting point
  • bin_seeding=True in sklearn drastically speeds up large datasets
  • Always scale features — bandwidth is meaningless across different scales
  • Convergence is guaranteed for Gaussian kernels (monotone density increase)
  • Forgetting to scale features before setting bandwidth
  • Using too small a quantile in estimate_bandwidth → too many clusters
  • Running on large datasets without bin_seeding=True
  • Not merging nearby modes — duplicating cluster centers
07

Blob clusters (varied density)

Excellent

Mean Shift's home territory — finds modes naturally

💡 Use estimate_bandwidth with quantile matching cluster spread
🖼

Image pixels (color segmentation)

Excellent

Classic application — groups similar colors into segments

💡 Run in Lab color space for perceptual uniformity

Ring / concentric shapes

Poor

Density is uniform around rings — modes smear

💡 Use DBSCAN for ring/shell shapes
📐

High-dimensional (d > 20)

Poor

Curse of dimensionality kills density estimation

💡 Apply PCA first to reduce to 5-10 dimensions
📊

Large datasets (N > 50k)

Context-Dependent

O(n²t) complexity is prohibitive without bin_seeding

💡 Use bin_seeding=True or subsample then predict
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: mean-shift

Mean Shift Convergence

Points iteratively shift toward density modes

Scatter plot showing initial points (cyan), shift trajectories (dashed lines), and final cluster centers (gold stars). Points converging to the same center share a color.

Bandwidth Effect on Cluster Count

How bandwidth h controls the number of discovered clusters

Smaller bandwidth → more clusters. estimate_bandwidth() targets h ≈ 1.5 for this data.
09
  • No k required

    Automatically discovers the number of clusters from data density.

  • Arbitrary cluster shapes

    Follows density contours — handles elongated or irregular clusters.

  • Robust to outliers

    Outliers in sparse regions converge to their own small cluster or get ignored.

  • Single tunable parameter

    Only bandwidth h matters — no distance metric choice, no linkage criterion.

  • Convergence guaranteed

    Provably converges for Gaussian kernels (monotone density increase).

  • O(n²) complexity

    Slow on large datasets — each of n points scans all n neighbors per iteration.

  • Bandwidth sensitivity

    Wrong h → garbage clusters. Estimation works but is imperfect.

  • Fails in high dimensions

    KDE becomes unreliable above ~10-15 dimensions (curse of dimensionality).

  • Uniform density → one cluster

    If data has no density peaks, everything merges.

  • No probabilistic output

    Doesn't produce soft cluster membership or uncertainty estimates.

10
Computer Vision

Image segmentation

Group pixels by color and spatial proximity — used in semantic segmentation pipelines.

Robotics

Object tracking

Track moving objects by finding the mode of their color histogram in successive frames.

Geospatial

Hotspot detection

Find density peaks in crime maps, disease spread, or traffic accident locations.

Retail

Customer spatial clustering

Identify store location hotspots based on customer address density.

11

Mean Shift vs. other unsupervised clustering algorithms:

K-Means

Iterative centroid updates

K-Means requires k; assumes spherical clusters; Mean Shift is non-parametric

K-Means when k is known, clusters are spherical, and speed matters

DBSCAN

Density-based; doesn't require k

DBSCAN labels low-density points as noise; handles ring/shell shapes; faster O(n log n)

DBSCAN for noise-heavy data or non-convex shapes

Gaussian Mixture Model

Finds density modes (Gaussian components)

GMM is probabilistic; gives soft assignments; requires specifying k components

GMM when you need probability of cluster membership

AspectMean ShiftK-MeansDBSCAN
Requires kNoYesNo
ComplexityO(n²t)O(nkt)O(n log n)
Cluster shapeAnySphericalAny
Handles noisePartialNoYes
Parametersbandwidth hkε, minPts

You don't know k, clusters may be non-spherical, dataset is small-to-medium, and you want the simplest possible parameter setup.

12

Silhouette Score

Measures cohesion vs. separation. Range [-1, 1], higher is better.

Target: > 0.5

Davies-Bouldin Index

Average ratio of within-cluster scatter to between-cluster distance. Lower is better.

Target: < 0.5

Number of Clusters

Sanity check — does the discovered k make domain sense?

Target: Matches domain expectation

  1. 01.Run with estimate_bandwidth quantile in [0.1, 0.3, 0.5] — compare cluster counts
  2. 02.Compute silhouette score for each bandwidth
  3. 03.Plot clusters in 2D (PCA if needed) and visually validate
  4. 04.Check if cluster sizes are reasonable — tiny clusters may be artifacts
  5. 05.Compare against DBSCAN or K-Means on same data for sanity
  • Silhouette score can be high even with wrong bandwidth — always visualize
  • Many micro-clusters (bandwidth too small) look like good separation numerically
  • estimate_bandwidth() may fail on tiny datasets — set bandwidth manually

In image segmentation, 3 clusters with silhouette 0.62 and DB-index 0.41 indicate clean color separation. Visual inspection confirms sky, ground, and object regions are correctly separated.

13
  • ×Applying Mean Shift to un-scaled features and wondering why bandwidth fails
  • ×Using quantile=0.5 in estimate_bandwidth → always 1-2 clusters
  • ×Forgetting that high-d KDE is unreliable — applying to 50-feature datasets
  • ×Not using bin_seeding=True — 10x slower for N > 5k
  • ×Treating convergence tolerance as the only parameter to tune
  • ×Not merging very close cluster centers in from-scratch implementations
  • ×Saying 'Mean Shift is just K-Means without k' — it's fundamentally different (density modes vs. centroids)
  • ×Claiming O(n) complexity — it's O(n²) per iteration
  • ×Not knowing that the mean shift vector equals the KDE gradient direction
  • ×Running Mean Shift on images without downsampling first — too slow
  • ×Using Mean Shift for customer segmentation with 100k rows — use Mini-Batch K-Means instead
  • ×Not checking for bandwidth sensitivity — always run at 3 bandwidth values
14

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • Finds cluster count automatically by seeking density modes
  • Only parameter: bandwidth h (use estimate_bandwidth for auto-selection)
  • Convergence guaranteed for Gaussian kernels
  • O(n²) — use bin_seeding for speed; not suited for large datasets
  • Best for: image segmentation, small-to-medium arbitrary-shape clusters
Mean Shift
KDE
  • Image/video segmentation
  • Unknown cluster count
  • Non-spherical cluster shapes
  • Small-to-medium datasets
  • N > 50k without bin_seeding
  • d > 15 dimensions
  • Clusters with uniform internal density
  • Real-time clustering
Mean Shift = gradient ascent on KDE — each step moves toward denser region
Bandwidth h is the only parameter — too small → many clusters, too large → one cluster
Unlike K-Means, cluster count is discovered, not specified
Convergence is guaranteed (provable for Gaussian kernel)
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.