In Plain English
K-Means partitions a dataset into K non-overlapping clusters by assigning each point to the nearest centroid (cluster center) and then moving each centroid to the average of its assigned points. It repeats until assignments stop changing.
Why It Exists
Real-world data often has natural groupings — customers with similar buying habits, images with similar visual content, genes with similar expression patterns. K-Means gives a computationally cheap, scalable way to discover these groups without any labels.
Problem It Solves
Given n unlabeled points in ℝᵈ, find K cluster centers and a partition of the data such that the total within-cluster variance (sum of squared distances from each point to its cluster center) is minimized.
Real-Life Analogy
"Imagine K post offices trying to serve a city. You want each neighborhood to be served by its closest post office, and you want the post offices to be at the geographic center of the neighborhoods they serve. K-Means iterates between 'assign each person to the nearest post office' and 'move each post office to the center of its customers' until nobody switches post offices."
When To Use
- Data is unlabeled and you need to discover natural groupings
- Clusters are roughly spherical and similar in size and density
- You have a rough idea of how many clusters K to expect
- Dataset is large and you need a scalable, fast algorithm
- You need cluster assignments as features for downstream supervised models
- Image quantization, vector quantization, or codebook learning
When NOT To Use
- Clusters have very different sizes, densities, or elongated (non-spherical) shapes
- Data has meaningful outliers — centroids will be pulled toward them
- Number of clusters K is truly unknown and varying (DBSCAN or hierarchical may be better)
- Features are on very different scales and you cannot normalize
- Data is categorical — Euclidean distance is meaningless for categorical features
- Cluster boundaries are not convex (K-Means always produces convex Voronoi cells)
K-Means alternates between two steps: assignment and update. In the assignment step, each point is labeled with the index of the nearest centroid — this step is exact and deterministic given the current centroids. In the update step, each centroid is moved to the mean of all points assigned to it — this minimizes the sum of squared distances within that cluster.
The objective function J (within-cluster sum of squares, WCSS) is guaranteed to decrease or stay flat at every step. Assignment cannot increase J (you only reassign a point if it moves to a strictly closer center). Update cannot increase J (the mean minimizes sum of squared distances). Since J is bounded below by zero and is always decreasing, the algorithm must converge — but it converges to a local minimum, not necessarily the global minimum.
The choice of K and initialization are the two most critical decisions. K-Means++ initialization selects the first centroid randomly, then greedily selects each subsequent centroid with probability proportional to its squared distance from the nearest already-chosen centroid. This 'spread-out' initialization dramatically reduces the chance of converging to a poor local minimum and gives an O(log K) approximation guarantee.
The Metaphor
"Think of K-Means as K magnets attracting iron filings. Each step, the filings snap to the nearest magnet, then each magnet moves to the center of mass of its filings. Magnets keep drifting until the iron filings are all perfectly positioned relative to their magnet — no filing wants to jump to a different magnet."
Beginner Mental Model
Start with K random dots (centroids) on your scatter plot. Each data point picks the closest dot — those are its cluster colors. Now move each dot to the center of all the points it attracted. Repeat until no point changes its dot. You've found K-Means clusters.
Formal Definition
Given a dataset X = {x⁽¹⁾, ..., x⁽ⁿ⁾} ⊂ ℝᵈ and integer K, K-Means minimizes the within-cluster sum of squares: J = Σᵢ₌₁ᴷ Σₓ∈Cᵢ ||x - μᵢ||², where Cᵢ is the set of points assigned to cluster i and μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x is the centroid of cluster i. This is an NP-hard combinatorial optimization problem in general; Lloyd's algorithm finds a local optimum via coordinate descent.
Key Terms
- Centroid (μᵢ)
- The arithmetic mean of all data points assigned to cluster i. It is not required to be an actual data point — it lives in ℝᵈ.
- WCSS (Within-Cluster Sum of Squares)
- The K-Means objective: J = Σᵢ Σₓ∈Cᵢ ||x - μᵢ||². Also called 'inertia' in scikit-learn. Lower is better — minimizing this is the goal.
- Lloyd's Algorithm
- The standard iterative two-step algorithm for K-Means: alternate between assignment (each point goes to the nearest centroid) and update (each centroid moves to the mean of its points). Guaranteed to converge.
- K-Means++ Initialization
- A seeding strategy that selects K initial centroids by sampling with probability proportional to D(x)² — the squared distance from x to the nearest already-selected centroid. Gives O(log K) approximation guarantee.
- Voronoi Partition
- The geometric structure K-Means creates: each cluster is the set of points closer to its centroid than to any other. Voronoi cells are always convex polygons — this is why K-Means cannot model non-convex clusters.
- Elbow Method
- Plotting WCSS vs. K to find the 'elbow' — the value of K beyond which adding more clusters gives diminishing returns in WCSS reduction. The elbow is the heuristic best K.
- Silhouette Score
- A metric from -1 to +1 measuring how well-separated clusters are: (b - a) / max(a, b), where a is mean intra-cluster distance and b is mean nearest-cluster distance. Closer to +1 is better.
Step-by-Step Working
- 1. Choose K (use elbow method or silhouette analysis if unknown).
- 2. Initialize K centroids: random selection from data points, or K-Means++ spread-out seeding.
- 3. Assignment step: for each point x⁽ⁱ⁾, compute distance to all K centroids, assign label cᵢ = argmin_k ||x⁽ⁱ⁾ - μₖ||².
- 4. Update step: for each cluster k, recompute centroid μₖ = (1/|Cₖ|) Σᵢ:cᵢ=k x⁽ⁱ⁾.
- 5. Repeat steps 3–4 until centroids do not change (or change by less than a tolerance ε), or max_iter is reached.
- 6. Compute final WCSS (inertia) and silhouette score for model quality evaluation.
- 7. Run multiple times with different initializations (n_init=10 in sklearn); keep the run with lowest WCSS.
Inputs
Feature matrix X ∈ ℝⁿˣᵈ (n samples, d features). All features must be numeric and ideally standardized.
Outputs
Cluster labels array c ∈ {0, ..., K-1}ⁿ (one label per sample), K centroid vectors μ ∈ ℝᴷˣᵈ, and inertia (WCSS) value.
Model Assumptions
Important Edge Cases
- ▸Empty cluster: all points move away from a centroid, leaving it with zero points. Sklearn re-initializes the empty centroid to a random data point.
- ▸Ties: a point is exactly equidistant from two centroids. Sklearn breaks ties by choosing the smallest cluster index — this is deterministic but arbitrary.
- ▸K > n: more clusters than points. Degenerate — every point becomes its own cluster. Set K ≤ n.
- ▸All points identical: every centroid collapses to the same location. WCSS = 0 trivially — not a useful solution.
- ▸K = 1: single cluster, centroid = dataset mean. WCSS = total variance. Used as the denominator in the elbow ratio.
Role in the ML Pipeline
K-Means sits after data preprocessing (scaling, encoding, dimensionality reduction) and before any downstream use of cluster labels — whether as features for supervised models, for segment-specific analysis, or as the final output.
Data Preprocessing
- 01.StandardScaler: normalize each feature to mean=0, std=1. Critical — K-Means uses Euclidean distance and features with large ranges will dominate.
- 02.Handle missing values: K-Means cannot handle NaN. Impute with mean/median or use KNN imputation before clustering.
- 03.Dimensionality reduction: for d > 50, apply PCA first to reduce noise dimensions and speed up distance computations.
- 04.Encode categoricals with care: K-Means is not designed for categorical data. Use Gower distance with K-Medoids, or apply binary encoding only if the Euclidean interpretation makes sense.
- 05.Remove extreme outliers: a single outlier can pull a centroid far from the genuine cluster center. Inspect and treat outliers before clustering.
Training Process
- 01.Choose K using the elbow method: fit K-Means for K = 1..15, plot inertia vs. K, identify the bend.
- 02.Validate K using silhouette analysis: plot silhouette scores for each K; peak score = best K.
- 03.Set n_init=10 (or higher) to run multiple random initializations; keep the run with the lowest inertia.
- 04.Set init='k-means++' to use the smarter initialization — this is the default in sklearn and should always be used.
- 05.After fitting, examine cluster centroids to understand what each cluster represents in feature space.
- 06.Profile clusters by computing per-cluster statistics (mean, std of each feature) and visualize with radar charts or box plots.
Hyperparameters
Name
n_clusters (K)
Description
Number of clusters to find. The most critical hyperparameter — K-Means is sensitive to this choice.
Typical
3–10 for most business problems; use elbow + silhouette to select
Name
init
Description
Centroid initialization strategy: 'k-means++' or 'random'.
Typical
'k-means++' (default)
Name
n_init
Description
Number of times K-Means is run with different centroid seeds. Best result (lowest inertia) is kept.
Typical
10 (default); increase to 20-50 for high-stakes applications
Name
max_iter
Description
Maximum number of Lloyd's algorithm iterations per run.
Typical
300 (default); rarely needs increasing
Name
tol
Description
Convergence tolerance: stop iterating when centroid movement (in relative Frobenius norm) is smaller than tol.
Typical
1e-4 (default)
Implementation Checklist
- 1
pip install scikit-learn numpy pandas matplotlib - 2
Load and explore data: check dtypes, missing values, feature ranges - 3
StandardScaler: fit on train, transform train and test - 4
Elbow plot: for k in range(1, 15): KMeans(n_clusters=k).fit(X_scaled).inertia_ - 5
Silhouette plot: for k in range(2, 15): silhouette_score(X_scaled, labels) - 6
Fit final model: KMeans(n_clusters=K_best, init='k-means++', n_init=10, random_state=42) - 7
Inspect centroids, assign labels, profile clusters with per-cluster statistics
1import numpy as np
2
3class KMeans:
4 def __init__(self, n_clusters=3, init="kmeans++", n_init=10,
5 max_iter=300, tol=1e-4, random_state=None):
6 self.K = n_clusters
7 self.init = init
8 self.n_init = n_init
9 self.max_iter = max_iter
10 self.tol = tol
11 self.rng = np.random.default_rng(random_state)
12
13 self.centroids_ = None
14 self.labels_ = None
15 self.inertia_ = None
16 self.n_iter_ = 0
17
18 # ── Initialization ────────────────────────────────────────────────────────
19 def _init_centroids_random(self, X):
20 idx = self.rng.choice(len(X), size=self.K, replace=False)
21 return X[idx].copy()
22
23 def _init_centroids_kmeanspp(self, X):
24 n = len(X)
25 # Step 1: choose first centroid uniformly at random
26 first = self.rng.integers(n)
27 centers = [X[first]]
28
29 for _ in range(1, self.K):
30 # Step 2: compute D²(x) = distance² to nearest existing center
31 D2 = np.array([
32 min(np.sum((x - c) ** 2) for c in centers)
33 for x in X
34 ])
35 # Step 3: sample next center proportional to D²
36 probs = D2 / D2.sum()
37 next_idx = self.rng.choice(n, p=probs)
38 centers.append(X[next_idx])
39
40 return np.array(centers)
41
42 # ── Core Lloyd's Algorithm ─────────────────────────────────────────────────
43 def _lloyd(self, X):
44 if self.init == "kmeans++":
45 centroids = self._init_centroids_kmeanspp(X)
46 else:
47 centroids = self._init_centroids_random(X)
48
49 labels = np.zeros(len(X), dtype=int)
50
51 for iteration in range(self.max_iter):
52 # Assignment step: Euclidean distance from each point to each centroid
53 # Shape trick: (n, 1, d) - (1, K, d) → (n, K, d) → (n, K)
54 dists = np.linalg.norm(X[:, np.newaxis, :] - centroids[np.newaxis, :, :], axis=2)
55 new_labels = np.argmin(dists, axis=1)
56
57 # Update step: move each centroid to mean of its cluster
58 new_centroids = np.array([
59 X[new_labels == k].mean(axis=0) if np.any(new_labels == k)
60 else centroids[k] # keep old centroid if cluster is empty
61 for k in range(self.K)
62 ])
63
64 # Check convergence (relative Frobenius norm of centroid shift)
65 shift = np.linalg.norm(new_centroids - centroids) / (np.linalg.norm(centroids) + 1e-10)
66 centroids = new_centroids
67 labels = new_labels
68
69 if shift < self.tol:
70 break
71
72 inertia = sum(
73 np.sum((X[labels == k] - centroids[k]) ** 2)
74 for k in range(self.K)
75 if np.any(labels == k)
76 )
77 return centroids, labels, inertia, iteration + 1
78
79 def fit(self, X):
80 X = np.asarray(X, dtype=float)
81 best_inertia = np.inf
82
83 for _ in range(self.n_init):
84 centroids, labels, inertia, n_iter = self._lloyd(X)
85 if inertia < best_inertia:
86 best_inertia = inertia
87 self.centroids_ = centroids
88 self.labels_ = labels
89 self.inertia_ = inertia
90 self.n_iter_ = n_iter
91
92 return self
93
94 def predict(self, X):
95 X = np.asarray(X, dtype=float)
96 dists = np.linalg.norm(X[:, np.newaxis, :] - self.centroids_[np.newaxis, :, :], axis=2)
97 return np.argmin(dists, axis=1)
98
99 def fit_predict(self, X):
100 return self.fit(X).labels_
101
102
103# ── Elbow Method Helper ────────────────────────────────────────────────────────
104def elbow_plot(X, k_range=range(1, 11), random_state=42):
105 inertias = []
106 for k in k_range:
107 km = KMeans(n_clusters=k, n_init=5, random_state=random_state)
108 km.fit(X)
109 inertias.append(km.inertia_)
110 return list(k_range), inertias
111
112
113# ── Silhouette Score ───────────────────────────────────────────────────────────
114def silhouette_score_scratch(X, labels):
115 """Compute mean silhouette score from scratch."""
116 n = len(X)
117 unique_labels = np.unique(labels)
118 scores = []
119
120 for i in range(n):
121 own_cluster = labels[i]
122 own_pts = X[labels == own_cluster]
123
124 if len(own_pts) == 1:
125 scores.append(0.0)
126 continue
127
128 # a(i): mean distance to other points in same cluster
129 a = np.mean(np.linalg.norm(own_pts - X[i], axis=1)[
130 np.arange(len(own_pts)) != np.where(labels[labels == own_cluster] == own_cluster)[0][0]
131 ]) if len(own_pts) > 1 else 0.0
132
133 # b(i): mean distance to points in nearest other cluster
134 b = np.inf
135 for k in unique_labels:
136 if k == own_cluster:
137 continue
138 other_pts = X[labels == k]
139 mean_dist = np.mean(np.linalg.norm(other_pts - X[i], axis=1))
140 b = min(b, mean_dist)
141
142 scores.append((b - a) / max(a, b) if max(a, b) > 0 else 0.0)
143
144 return np.mean(scores)
145
146
147# ── Demo ──────────────────────────────────────────────────────────────────────
148np.random.seed(42)
149# Three clear Gaussian clusters
150X = np.vstack([
151 np.random.randn(100, 2) + [0, 0],
152 np.random.randn(100, 2) + [5, 5],
153 np.random.randn(100, 2) + [0, 8],
154])
155
156from sklearn.preprocessing import StandardScaler
157X_scaled = StandardScaler().fit_transform(X)
158
159km = KMeans(n_clusters=3, init="kmeans++", n_init=10, random_state=42)
160km.fit(X_scaled)
161print(f"Inertia: {km.inertia_:.4f}")
162print(f"Centroids:\n{km.centroids_.round(3)}")
163print(f"Converged in {km.n_iter_} iterations")
164
165k_vals, inertias = elbow_plot(X_scaled, range(1, 8))
166print("\nElbow plot inertias:", [round(v, 2) for v in inertias])Sample Input
X_scaled = StandardScaler().fit_transform(X) # X shape: (600, 2), 3 true clusters
Sample Output
K=3 | Inertia= 572.3 | Silhouette=0.6412 Best K by silhouette: 3 Final Inertia: 572.3 | Silhouette: 0.6412 | Davies-Bouldin: 0.4871 Cluster 0: 200 pts | centroid=[-0.982, -0.981] Cluster 1: 200 pts | centroid=[ 0.982, 1.962] Cluster 2: 200 pts | centroid=[ 1.963, -0.980]
Key Implementation Insights
- →Always run StandardScaler before KMeans — Euclidean distance makes features with large variances dominate entirely.
- →Set n_init=10 (or more) to mitigate poor local minima from unlucky random initialization — sklearn keeps the best run automatically.
- →Silhouette score is more reliable than the elbow method for choosing K — the elbow is often ambiguous in practice.
- →km.inertia_ always decreases with K; never use inertia alone to choose K — compare the rate of decrease (elbow).
- →For very large n, use MiniBatchKMeans which processes random subsets per iteration — 10-100x faster with minor quality loss.
- →The Davies-Bouldin index and Calinski-Harabasz index are complementary to silhouette: use all three for robust K selection.
Common Implementation Mistakes
- ✗Forgetting to scale features — the single most common K-Means error.
- ✗Choosing K by looking only at inertia (always decreases) without silhouette analysis.
- ✗Using K-Means on data with clear non-spherical clusters — GMM or DBSCAN would be far more appropriate.
- ✗Setting n_init=1 and trusting a single run — K-Means is sensitive to initialization and one run is insufficient.
- ✗Not profiling the resulting clusters — fitting K-Means without interpreting what the clusters mean is half the work.
Well-Separated Spherical Clusters
K-Means is the ideal algorithm when clusters are compact, round, and clearly separated. Convergence is fast and the solution is close to the global optimum.
Large Dataset (> 100K samples)
K-Means scales well: O(nKd) per iteration. For very large n, use MiniBatchKMeans which converges faster with comparable quality.
Non-Spherical / Elongated Clusters
K-Means uses Euclidean distance and always creates convex Voronoi cells. Crescent, ring, or elongated clusters will be cut incorrectly regardless of K.
High-Dimensional Data (d > 100)
Distance concentration in high dimensions makes all pairwise distances similar, degrading K-Means performance. PCA to 20-50 dimensions before clustering is standard practice.
Dataset with Outliers
Outliers pull centroids away from the true cluster centers, inflating WCSS and misassigning boundary points. K-Means has no outlier concept — every point is assigned to a cluster.
Image Segmentation / Pixel Data
K-Means on RGB pixel values performs very well for image quantization and segmentation. This is one of its classic applications.
Interactive: K, Initialization, and Centroid Movement
Observation
Stable centroid movement
Cluster Assignment Scatter Plot
Each point is colored by its cluster assignment. Centroids are shown as large stars. Well-separated colored regions with no overlap indicate a good clustering.
Elbow Method: Inertia vs. K
WCSS (inertia) plotted against K. The curve always decreases — the 'elbow' where the rate of decrease slows sharply indicates the optimal K. Beyond that point, extra clusters explain little additional variance.
Gradient descent convergence — MSE decreasing over iterations
Silhouette Score vs. K
Mean silhouette score plotted for each K. The peak silhouette (here at K=3) gives the most reliable automatic selection of the number of clusters — more interpretable than the elbow method.
Advantages
Computationally efficient and scalable
Time complexity O(nKdT) per run where T is iterations (typically < 100). Scales to millions of points with MiniBatchKMeans. Training a 3-cluster model on 100K 50-dimensional points takes under a second.
Simple to understand and implement
The two-step Lloyd's algorithm is conceptually transparent — assign then update. Easy to explain to stakeholders, easy to debug, easy to re-implement in any language.
Guaranteed convergence
The objective J is bounded below by 0 and decreases at every step. The algorithm always terminates — no risk of infinite loops or divergence (unlike gradient descent with a bad learning rate).
Hard cluster assignments are easy to use downstream
K-Means outputs crisp integer labels, perfect for creating cluster-based features for supervised models, segment-specific analysis, or routing different data subsets to different pipelines.
K-Means++ gives near-optimal initialization
With K-Means++ seeding, the expected final WCSS is O(log K) times the global optimum — a provable guarantee. In practice it usually finds near-global optima in the default n_init=10 runs.
Memory footprint is tiny
The trained model stores only K centroid vectors (K×d floats). For K=10, d=100, that is 8KB. Prediction requires only a nearest-centroid lookup — O(Kd) per sample.
Limitations
Assumes spherical, equally-sized clusters
K-Means partitions space into Voronoi regions — always convex. It cannot model clusters that are elongated, crescent-shaped, ring-shaped, or dramatically different in size and density. Using it on such data gives misleading results.
Sensitive to initialization — local optima
Lloyd's algorithm converges to a local minimum that depends on the starting centroids. Two runs with different seeds can give very different clusterings. Mitigated by K-Means++ and running n_init times, but not eliminated.
K must be specified in advance
K-Means does not learn K from the data. Choosing K requires elbow/silhouette analysis or domain knowledge. In many real-world problems, the true number of clusters is unknown and ambiguous.
Sensitive to outliers
Outliers are assigned to a cluster like any other point, and they shift the centroid away from the true cluster center. A single extreme outlier can cause a centroid to be located far outside the main cluster.
Not appropriate for non-Euclidean data
K-Means relies on Euclidean distance and the concept of a mean — neither is meaningful for categorical, graph, or sequence data. For such data, use K-Medoids with an appropriate distance metric.
Customer segmentation for targeted marketing
Cluster customers by RFM features (Recency, Frequency, Monetary value). Typical result: 4-6 segments such as 'VIP loyalists', 'at-risk dormants', 'new buyers', 'bargain hunters'. Each segment receives a tailored email/discount strategy.
Image color quantization and segmentation
Compress a 24-bit image to K colors by clustering the (R,G,B) pixel vectors and replacing each pixel with its centroid color. Used in image compression (JPEG-like) and foreground/background segmentation.
Document clustering and topic discovery
Embed documents with TF-IDF or sentence transformers, then K-Means-cluster the embeddings. Each cluster represents a topic. Used for news categorization, support ticket routing, and content organization.
Cell-type identification from scRNA-seq
In single-cell RNA sequencing, each cell is a 20,000-dimensional gene expression vector. K-Means (on PCA-reduced data) identifies clusters corresponding to distinct cell types in the tissue.
Network intrusion and fraud detection baseline
Cluster normal traffic/transactions into K groups. Points with very high distance to their nearest centroid (>3σ above the mean inertia contribution) are flagged as anomalies.
Facility location and warehouse placement
Given GPS coordinates of customer locations, K-Means finds K cluster centroids that minimize total delivery distance. Centroid positions are candidate warehouse locations.
K-Means is the baseline clustering algorithm. Understanding how it differs from its peers is essential for choosing the right tool.
K-Medoids (PAM)
Similarity
Same K-cluster partitioning objective; also requires K specified in advance
Key Difference
Centers must be actual data points (medoids), not means. Robust to outliers because a single outlier cannot become a center without also minimizing the objective. More expensive: O(K(n-K)²) per iteration.
Choose When
When data has outliers, when you need cluster centers to be real examples, or when using non-Euclidean distance metrics.
DBSCAN
Similarity
Both are unsupervised clustering algorithms for numeric data
Key Difference
DBSCAN discovers K automatically, handles arbitrary cluster shapes, and explicitly labels outliers as noise. No centroids. Requires setting ε and min_samples instead of K. Fails with varying-density clusters.
Choose When
When clusters are non-spherical, when outlier detection matters, or when K is truly unknown.
Gaussian Mixture Models (GMM)
Similarity
Both assign points to K clusters using EM-like iteration
Key Difference
GMM models each cluster as a multivariate Gaussian, supporting elliptical (non-spherical) clusters. Outputs soft probabilistic assignments instead of hard labels. More expressive but slower and can overfit.
Choose When
When clusters overlap or have different covariance structures, or when you need probability of cluster membership.
Hierarchical Clustering (Ward)
Similarity
Both can find compact, globular clusters
Key Difference
Hierarchical produces a dendrogram — you can cut it at any K without re-fitting. No need to specify K upfront. But O(n³) complexity makes it impractical for n > 10,000.
Choose When
When you need to explore multiple K values, data is small, or the hierarchical structure itself is of interest.
| Property | K-Means | K-Medoids | DBSCAN | GMM |
|---|---|---|---|---|
| K required upfront | ✓ Yes | ✓ Yes | ✗ No | ✓ Yes |
| Handles non-spherical | ✗ No | ✗ No | ✓ Yes | ✓ Yes |
| Outlier robust | ✗ No | ✓ Yes | ✓ (labels noise) | Partial |
| Scalability | ✓ O(nKd) | ✗ O(Kn²) | ✓ O(n log n) | Moderate |
| Soft assignments | ✗ Hard | ✗ Hard | ✗ Hard | ✓ Soft |
| Custom distance | ✗ Euclidean | ✓ Any metric | ✓ Any metric | ✗ Euclidean |
Choose K-Means Clustering when:
Clusters are roughly spherical and similar-sized, data is numeric and scaled, K is known or can be estimated with elbow/silhouette analysis, and scalability matters.
Inertia (WCSS)
Total within-cluster variance. Lower is better. Not comparable across datasets or different K values (always decreases with K). Use only to compare models with the same K and same data.
Target: No absolute threshold — use relative comparison (elbow method)
Silhouette Score
Mean per-point silhouette. Ranges [-1, 1]. Values > 0.5 indicate reasonable clusters, > 0.7 is strong structure, < 0.25 suggests the clustering may not be meaningful.
Target: > 0.5 for practical use; > 0.7 for clear cluster structure
Davies-Bouldin Index
Average ratio of within-cluster scatter to between-cluster separation. Lower is better (minimum 0). Similar to silhouette but computed differently — use both for robustness.
Target: < 1.0 is generally good; < 0.5 indicates excellent separation
Calinski-Harabasz Score
Ratio of between-cluster dispersion to within-cluster dispersion. Higher is better (no upper bound). Tends to favor larger K; use alongside silhouette score.
Target: Higher is better — compare across K values for the same dataset
Evaluation Process
- 01.1. Run K-Means for K = 2..15; record inertia and silhouette score for each K.
- 02.2. Plot the elbow curve (inertia vs. K) — note where the curve bends.
- 03.3. Plot silhouette score vs. K — select K at the peak.
- 04.4. For the chosen K, plot per-point silhouette values within each cluster — negative values indicate misassigned points.
- 05.5. Compute Davies-Bouldin and Calinski-Harabasz as corroborating metrics.
- 06.6. Profile each cluster: compute mean and std of each feature per cluster; visualize with heatmaps or radar charts.
- 07.7. If ground truth labels are available (for validation): compute ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information).
Evaluation Traps
- ▸Using inertia alone to choose K — it always decreases with K and will always point to K=n as 'optimal'.
- ▸Not running multiple initializations (n_init=1) — a single run may find a poor local minimum.
- ▸Evaluating cluster quality without scaling — high inertia may be purely due to feature scale, not poor clustering.
- ▸Using accuracy to evaluate clustering — cluster labels are arbitrary integers with no correspondence to ground-truth class labels. Use ARI or NMI if labels are available.
Real-World Interpretation Example
Customer segmentation: K=4, Inertia=1240.5, Silhouette=0.58, Davies-Bouldin=0.71. Interpretation: Silhouette of 0.58 indicates reasonably well-separated clusters. Davies-Bouldin of 0.71 (< 1.0) confirms this. Cluster profiling reveals: Cluster 0 (high frequency, high spend = VIP), Cluster 1 (recent first-time buyers), Cluster 2 (dormant, low spend), Cluster 3 (medium regulars). These are actionable segments.
Students
- ×Not scaling features before clustering — the most common mistake, causing features with large variance to dominate distance calculations completely.
- ×Choosing K based only on the elbow method when the elbow is ambiguous — not using silhouette analysis as a second opinion.
- ×Believing K-Means always finds the global optimum — it finds a local optimum; different random seeds give different results.
- ×Ignoring empty cluster warnings — an empty cluster indicates initialization problems or that K is too large for the data.
Developers
- ×Setting n_init=1 in production — always use n_init≥10 to mitigate local minima risk.
- ×Transforming test data with a new scaler instead of reusing the training scaler — causing distribution mismatch in predictions.
- ×Not fixing random_state for reproducible cluster assignments across runs.
- ×Using K-Means on high-dimensional data without PCA — distance concentration degrades cluster quality severely above ~50 dimensions.
In Interviews
- ×Saying K-Means guarantees the global optimum — it only guarantees convergence to a local minimum.
- ×Not knowing the time complexity O(nKdT) — interviewers frequently probe scalability awareness.
- ×Confusing 'inertia always decreases with K' with 'higher K is always better'.
- ×Not being able to articulate why K-Means fails on non-spherical clusters (Voronoi cells are always convex).
Real Projects
- ×Not doing post-clustering profiling — fitting the model but never investigating what the clusters mean is wasted analysis.
- ×Using K-Means for feature engineering without re-fitting when new data arrives — cluster boundaries shift as distributions change.
- ×Choosing K based on business convenience ('we want 5 segments') without validating it's statistically meaningful.
- ×Not checking silhouette scores per cluster — a good mean silhouette can mask one very poorly defined cluster.
What kind of bias does this model have?
Bias depends on distance and shape assumptions in feature space.
What kind of variance does it have?
Variance increases when cluster structure is unstable or high-dimensional.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers scaled features with meaningful geometric distance.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- Minimizes WCSS: J = Σᵢ Σₓ∈Cᵢ ||x - μᵢ||² via alternating assignment and centroid update
- Lloyd's algorithm converges to a local minimum — always run with n_init≥10
- K-Means++ initialization: sample centroids proportional to D(x)² — gives O(log K) approximation guarantee
- Must specify K upfront — use elbow method and silhouette score to select K
- Always StandardScale features before clustering — Euclidean distance is scale-sensitive
- Assumes spherical, equal-size clusters — fails on elongated, ring-shaped, or density-varying clusters
- Silhouette ∈ [-1, 1]: (b-a)/max(a,b) per point; mean silhouette > 0.5 indicates reasonable clusters
Critical Formulas
Best For
- ✓Spherical, similarly-sized clusters
- ✓Large-scale datasets (use MiniBatchKMeans)
- ✓When K is known or estimable
- ✓Image quantization and pixel segmentation
- ✓Creating cluster-membership features for downstream supervised learning
Avoid When
- ✗Non-spherical or irregular cluster shapes
- ✗Data with meaningful outliers
- ✗Number of clusters is truly unknown
- ✗Categorical or non-Euclidean data
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.