In Plain English
K-Medoids finds K clusters where each cluster center (medoid) must be an actual data point — not an abstract average. Every other point is assigned to the nearest medoid. This makes clusters interpretable ('the typical customer in segment 3 is customer #4521') and robust to outliers.
Why It Exists
K-Means centroids are mathematical averages that may not correspond to any real data point — in some domains this is unacceptable. A 'typical patient' with mean age 42.3, mean blood pressure 138.7 may not be a medically meaningful entity. K-Medoids was designed to guarantee that each cluster representative is a real, observable case.
Problem It Solves
Given n data points and a distance function d(·, ·) (not necessarily Euclidean), find K representative points (medoids) from the dataset such that the total cost — sum of distances from each point to its nearest medoid — is minimized.
Real-Life Analogy
"Imagine picking K team captains from a group of players. A captain must be an actual player on the team (not an imaginary average player). You want to choose captains such that each non-captain player is as close as possible to their assigned captain in skill level. K-Medoids does exactly this — except 'close' is measured by a distance function."
When To Use
- Cluster representatives must be real, interpretable examples from the dataset
- Data uses non-Euclidean distance metrics (edit distance, cosine, graph distance, Gower)
- Data contains significant outliers — K-Medoids is inherently robust
- Categorical or mixed-type data where 'mean' is not defined
- Small to medium datasets (n < 10,000) where PAM's O(K(n-K)²) cost is acceptable
- You need a single representative instance per cluster (e.g., a representative email, image, or drug molecule)
When NOT To Use
- Dataset is very large (n > 50,000) without using CLARA/CLARANS approximations
- Features are numeric, Euclidean distance applies, and outliers have been removed — use faster K-Means
- Cluster representatives being real examples is not required and speed is critical
- Very high dimensionality without dimensionality reduction (distance concentration still applies)
K-Means minimizes the sum of squared Euclidean distances and places centroids at arithmetic means — elegant, but the mean of a set of data points is rarely one of those points. K-Medoids changes the constraint: medoids must be chosen from the actual dataset. This is a combinatorial problem with C(n, K) possible configurations — you can't solve it by taking a derivative.
PAM (Partitioning Around Medoids) solves this combinatorially in two phases. BUILD: greedily select K medoids by picking, at each step, the point that most reduces the total cost. SWAP: for each (medoid, non-medoid) pair, evaluate whether swapping them reduces the total cost; execute the swap that gives the largest cost reduction; repeat until no swap improves the objective.
The robustness to outliers is structural. In K-Means, an outlier at position x_outlier shifts the centroid toward it proportionally to 1/n (because the mean includes it). In K-Medoids, the medoid is chosen from among all n points — an outlier can only be chosen as the medoid if it truly minimizes the total distance to its cluster members, which an outlier in an otherwise tight cluster never does. The outlier simply becomes an assigned point with large distance to its medoid — it does not distort the representative.
The Metaphor
"Imagine choosing a spokesperson for each political district. K-Means would create an imaginary average citizen — a statistical ghost who may not represent anyone. K-Medoids insists the spokesperson must be a real constituent. You pick the constituent who best represents the majority, and the spokesperson's eccentricities (outlier positions) don't influence who represents other districts."
Beginner Mental Model
Like K-Means, but with a rule: you can only put cluster centers on top of actual data points. Instead of moving the center to the average of its members, you try every possible 'swap' of a center with a non-center point, and make the swap only if it reduces the total distance. Repeat until no swap helps.
Formal Definition
Given a finite set X = {x⁽¹⁾, ..., x⁽ⁿ⁾} and a distance function d: X × X → ℝ≥0, K-Medoids minimizes the total cost T(M) = Σᵢ min_{mₖ ∈ M} d(xᵢ, mₖ), where M ⊂ X is a set of K medoids chosen from the data. This is equivalent to the K-Median problem in metric spaces. PAM (Partitioning Around Medoids) is the canonical exact algorithm for small-to-medium n.
Key Terms
- Medoid
- A data point m ∈ X that minimizes the sum of distances to all other points assigned to its cluster: m* = argmin_{x ∈ Cₖ} Σ_{x' ∈ Cₖ} d(x, x'). Unlike a centroid, a medoid is always a real observation.
- PAM (Partitioning Around Medoids)
- The standard exact K-Medoids algorithm with two phases: BUILD (greedy initialization) and SWAP (local search to swap medoids and non-medoids to reduce total cost). Proposed by Kaufman and Rousseeuw (1987).
- Total Cost T(M)
- The objective function: T(M) = Σᵢ d(xᵢ, nearest medoid in M). Equivalent to K-Means inertia but without squaring — K-Medoids minimizes the sum of absolute (first-power) distances, not squared distances.
- BUILD phase
- Greedy initialization of K medoids: start with the point minimizing total distance to all others; at each step, add the point that most reduces the current total cost. O(nK) computations.
- SWAP phase
- For each of the K(n-K) medoid/non-medoid pairs, compute the change in total cost ΔT if they were swapped. Execute the swap with the largest cost reduction. Repeat until no improvement. O(K(n-K)n) per pass.
- CLARA (Clustering Large Applications)
- Approximation of PAM for large n: apply PAM to multiple random subsamples of size s ≈ 40+2K; return the medoid set with the lowest total cost on the full dataset. O(Ks²) complexity — practical for large datasets.
- CLARANS (Clustering Large Applications based on Randomized Search)
- Randomized variant of K-Medoids that samples random (medoid, non-medoid) swap candidates instead of evaluating all K(n-K) pairs. Graph-based traversal of medoid configurations. Better solution quality than CLARA at similar cost.
Step-by-Step Working
- 1. Compute the full pairwise distance matrix D ∈ ℝⁿˣⁿ using chosen metric d(·, ·).
- 2. BUILD phase: initialize medoid set M. Select m₁ = argmin_x Σ_x' d(x, x') (most central point). For k=2..K: select the next medoid as the point that maximally reduces total cost when added to M.
- 3. Assign each non-medoid point to its nearest medoid: cᵢ = argmin_{mₖ ∈ M} d(xᵢ, mₖ).
- 4. SWAP phase: for each medoid mₖ ∈ M and each non-medoid xᵢ ∉ M, compute ΔT(mₖ ↔ xᵢ) = total cost if mₖ is replaced by xᵢ.
- 5. Execute the swap (mₖ*, xᵢ*) = argmin ΔT if min ΔT < 0 (cost reduction).
- 6. After each swap, recompute all assignments and ΔT values.
- 7. Repeat steps 4–6 until no improving swap exists (convergence).
- 8. Return final medoids M and assignments c.
Inputs
Pairwise distance matrix D ∈ ℝⁿˣⁿ (or a feature matrix X with a specified metric). Integer K. Distance metric d(·, ·) which can be any valid metric.
Outputs
K medoid indices (actual data point indices from 0 to n-1), cluster labels c ∈ {0,...,K-1}ⁿ, total cost T(M).
Model Assumptions
Important Edge Cases
- ▸K = 1: the single medoid is the most central point (minimizes total distance to all others) — the L1 analogue of the mean.
- ▸K = n: every point is its own medoid. Total cost = 0. Degenerate — identical to K-Means' K=n case.
- ▸Ties in ΔT: multiple swaps with identical cost reduction. Break ties arbitrarily — the final solution may differ, but the objective value is the same.
- ▸Disconnected metric space: if d(xᵢ, xⱼ) = ∞ for some pairs, the algorithm will assign points to medoids within reachable components only.
- ▸Non-metric distances: if d violates the triangle inequality (e.g., some similarity measures), the BUILD phase may produce suboptimal initialization, but SWAP still converges.
Role in the ML Pipeline
K-Medoids sits after distance matrix computation (or feature engineering) and before cluster analysis or downstream use. It is common in pipelines where the distance/similarity is the primary data artifact (e.g., sequence alignment scores, graph edit distances).
Data Preprocessing
- 01.Compute pairwise distance matrix: for Euclidean data use scipy.spatial.distance.pdist; for custom metrics implement pairwise computation.
- 02.Choose the distance metric thoughtfully: Manhattan (L1) for high-dimensional tabular data, cosine for text/embeddings, Gower distance for mixed categorical/numeric data, edit distance for strings, graph edit distance for molecules.
- 03.For CLARA: ensure subsample size s ≥ 40 + 2K to ensure statistical coverage.
- 04.Normalize features if using Euclidean or Manhattan distance — same as K-Means.
- 05.Remove duplicate points: identical points will always be co-assigned and can confuse medoid selection in early BUILD steps.
Training Process
- 01.Compute or obtain the n×n pairwise distance matrix D.
- 02.Run PAM with multiple random seeds (n_init≥5) or use the BUILD initialization and verify with SWAP.
- 03.Select K using silhouette analysis computed with the actual distance metric d (not Euclidean unless that is the chosen metric).
- 04.Inspect medoids: each medoid index points to an actual data point — examine that record to understand what the cluster represents.
- 05.Profile clusters by computing cluster-level statistics on original features.
- 06.For large n: use CLARA with num_samples=10 and subsample_size=max(40+2K, 1000).
Hyperparameters
Name
n_clusters (K)
Description
Number of medoids (clusters). Same role as in K-Means.
Typical
3–8 for most applications; use silhouette score with the chosen metric d
Name
metric / distance function
Description
The distance function d(·, ·) used to compute the cost matrix. This is the most impactful decision in K-Medoids and has no equivalent in K-Means.
Typical
Euclidean for numeric, cosine for text embeddings, Gower for mixed data
Name
init
Description
Initialization method: 'build' (PAM BUILD phase — greedy, better quality) or 'random' (random K medoids).
Typical
'build' for PAM; for CLARA, random subsampling
Name
max_iter
Description
Maximum number of SWAP phase passes.
Typical
300 (default in sklearn-extra); rarely binding — PAM typically converges in < 20 passes
Implementation Checklist
- 1
pip install scikit-learn-extra numpy scipy - 2
Compute pairwise distances: scipy.spatial.distance.cdist(X, X, metric='euclidean') - 3
Instantiate KMedoids(n_clusters=K, metric='euclidean', method='pam', init='build') - 4
Fit: km.fit(X) or km.fit(distance_matrix) with metric='precomputed' - 5
Inspect medoids: km.medoid_indices_ — these are actual dataset row indices - 6
Evaluate: silhouette_score(distance_matrix, labels, metric='precomputed') - 7
For large n: use KMedoids with method='alternate' (faster) or CLARA approximation
1import numpy as np
2from itertools import combinations
3
4class KMedoids:
5 """
6 Partitioning Around Medoids (PAM) - exact K-Medoids implementation.
7 Accepts a precomputed distance matrix for metric flexibility.
8 """
9
10 def __init__(self, n_clusters=3, max_iter=300, random_state=None, init='build'):
11 self.K = n_clusters
12 self.max_iter = max_iter
13 self.rng = np.random.default_rng(random_state)
14 self.init = init
15
16 self.medoid_indices_ = None
17 self.labels_ = None
18 self.inertia_ = None # total cost (sum of distances, not squared)
19
20 # ── Utility ────────────────────────────────────────────────────────────────
21 def _assign(self, D, medoids):
22 """Assign each point to nearest medoid. Returns labels and total cost."""
23 dist_to_medoids = D[:, medoids] # (n, K)
24 labels = np.argmin(dist_to_medoids, axis=1) # cluster index (0..K-1)
25 cost = dist_to_medoids[np.arange(len(D)), labels].sum()
26 return labels, cost
27
28 # ── PAM BUILD Phase ────────────────────────────────────────────────────────
29 def _build(self, D):
30 n = len(D)
31 medoids = []
32
33 # Step 1: pick the most central point
34 first = np.argmin(D.sum(axis=1))
35 medoids.append(first)
36
37 # Step 2: greedily add medoids
38 for _ in range(1, self.K):
39 best_gain = np.inf
40 best_candidate = -1
41 for candidate in range(n):
42 if candidate in medoids:
43 continue
44 # Total cost if we add 'candidate' to medoid set
45 current_costs = D[:, medoids].min(axis=1) # (n,)
46 new_costs = np.minimum(current_costs, D[:, candidate])
47 gain = new_costs.sum()
48 if gain < best_gain:
49 best_gain = gain
50 best_candidate = candidate
51 medoids.append(best_candidate)
52
53 return medoids
54
55 # ── PAM SWAP Phase ─────────────────────────────────────────────────────────
56 def _swap(self, D, medoids):
57 n = len(D)
58 medoids = list(medoids)
59 improved = True
60 n_iter = 0
61
62 while improved and n_iter < self.max_iter:
63 improved = False
64 n_iter += 1
65
66 best_delta = 0
67 best_swap = None
68
69 non_medoids = [i for i in range(n) if i not in medoids]
70
71 for k_idx, mk in enumerate(medoids):
72 for xh in non_medoids:
73 # Compute ΔT for swapping medoid mk with non-medoid xh
74 # For each non-medoid point xi, determine:
75 # - d1(xi): distance to current nearest medoid
76 # - d2(xi): distance to second nearest medoid (next best if mk removed)
77 test_medoids = medoids.copy()
78 test_medoids[k_idx] = xh
79 _, test_cost = self._assign(D, test_medoids)
80 _, current_cost = self._assign(D, medoids)
81 delta = test_cost - current_cost
82
83 if delta < best_delta:
84 best_delta = delta
85 best_swap = (k_idx, xh)
86
87 if best_swap is not None:
88 k_idx, xh = best_swap
89 old_medoid = medoids[k_idx]
90 medoids[k_idx] = xh
91 improved = True
92
93 labels, cost = self._assign(D, medoids)
94 return medoids, labels, cost, n_iter
95
96 def fit(self, D):
97 """
98 Fit K-Medoids to a precomputed distance matrix D (n x n).
99 """
100 D = np.asarray(D, dtype=float)
101 assert D.shape[0] == D.shape[1], "D must be a square distance matrix"
102
103 if self.init == 'build':
104 medoids = self._build(D)
105 else:
106 n = len(D)
107 medoids = list(self.rng.choice(n, size=self.K, replace=False))
108
109 medoids, labels, cost, n_iter = self._swap(D, medoids)
110
111 self.medoid_indices_ = np.array(medoids)
112 self.labels_ = labels
113 self.inertia_ = cost
114 self.n_iter_ = n_iter
115 return self
116
117 def predict(self, D_new):
118 """
119 D_new: (m, n) distances from m new points to all n training points.
120 Returns cluster labels for the m new points.
121 """
122 dist_to_medoids = D_new[:, self.medoid_indices_]
123 return np.argmin(dist_to_medoids, axis=1)
124
125
126# ── Demo ──────────────────────────────────────────────────────────────────────
127np.random.seed(42)
128X = np.vstack([
129 np.random.randn(30, 2) + [0, 0],
130 np.random.randn(30, 2) + [6, 0],
131 np.random.randn(30, 2) + [3, 5],
132])
133# Add a strong outlier
134X = np.vstack([X, [[20, 20]]])
135
136# Compute Euclidean distance matrix
137from scipy.spatial.distance import cdist
138D = cdist(X, X, metric='euclidean')
139
140km = KMedoids(n_clusters=3, random_state=42, init='build')
141km.fit(D)
142
143print(f"Medoid indices: {km.medoid_indices_}")
144print(f"Medoid points:\n{X[km.medoid_indices_].round(3)}")
145print(f"Total cost: {km.inertia_:.4f}")
146print(f"Converged in {km.n_iter_} SWAP passes")
147print(f"Outlier (idx 90) assigned to cluster: {km.labels_[90]}")
148print(f" Distance to its medoid: {D[90, km.medoid_indices_[km.labels_[90]]]:.2f}")Sample Input
X shape: (242, 3), 3 true Gaussian clusters + 2 outliers at [30,30,30] and [-30,-30,-30]
Sample Output
Medoid indices: [42, 165, 213] Medoid coords: [[ 0.1, 0.2, 0.0], [8.0, 0.1, 0.1], [4.1, 7.0, 0.0]] Total cost: 371.42 Silhouette: 0.7812 Outliers remain in clusters but don't distort medoids
Key Implementation Insights
- →K-Medoids accepts precomputed distance matrices — this is the key advantage enabling any metric: cosine, edit distance, Gower, graph-theoretic.
- →For n > 5,000, use method='alternate' (randomized swap) in sklearn-extra — it achieves similar quality to PAM with O(n) per iteration instead of O(Kn²).
- →medoid_indices_ contains actual row indices from your dataset — inspect X[kmed.medoid_indices_] to understand what each cluster represents.
- →Total cost (inertia_) uses raw distances, not squared. Comparing K-Medoids cost to K-Means inertia is apples-to-oranges.
- →Silhouette score for K-Medoids should be computed with metric='precomputed' using the same distance matrix used for clustering — not Euclidean if a different metric was used.
Common Implementation Mistakes
- ✗Using K-Medoids on n > 10,000 with PAM method without switching to CLARA/alternate — O(Kn²) becomes infeasible.
- ✗Not using metric='precomputed' when passing a distance matrix — sklearn will interpret it as a feature matrix and compute Euclidean distances on top of your distances.
- ✗Computing silhouette score with metric='euclidean' when clustering was done with a different metric — inconsistent evaluation.
- ✗Expecting K-Medoids to be faster than K-Means — it is always slower; use K-Medoids only when its properties (real representatives, robustness, custom metrics) are needed.
Data with Outliers
K-Medoids' defining advantage. Outliers contribute linearly to the total cost (not quadratically as in K-Means) and cannot distort medoids — an outlier in a tight cluster will never be selected as the medoid because it has the highest within-cluster distance sum.
Non-Euclidean Distance Data
The primary use case: DNA sequences (edit distance), drug molecules (Tanimoto similarity), text (cosine/Jaccard), social graphs (shortest path distance). K-Means cannot handle these; K-Medoids accepts any metric.
Small Dataset (< 5K samples)
PAM's exact O(K(n-K)²) algorithm is tractable for n < 5,000. Provides exact solutions with interpretable medoid representatives.
Large Dataset (> 50K samples)
PAM scales quadratically with n — infeasible for large datasets. Full pairwise distance matrix alone requires O(n²) memory (50K points = 20GB for float64).
Mixed Categorical and Numeric Data
With Gower distance (gower library), K-Medoids handles mixed data naturally. Gower distance normalizes each feature and uses Manhattan for numeric, Jaccard for binary, and simple matching for categorical.
High-Dimensional Data
Same curse of dimensionality as K-Means — Euclidean distances concentrate. With Manhattan (L1) distance, K-Medoids is somewhat more robust in high dimensions than K-Means with Euclidean.
Mandatory Visual Blueprint
What should move
At least one parameter, threshold, split, cluster state, or metric should change interactively.
What to observe
The learner should see how the concept affects error, fit, grouping, or decision quality.
Planned visual type
Interactive chart, step animation, or side-by-side failure-mode comparison.
Reference image slot
If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.
Topic key: k-medoid
K-Medoids vs K-Means Cluster Centers with Outlier
Scatter plot showing how K-Means centroids are pulled toward outliers while K-Medoids medoids remain inside the true cluster. Medoid markers (★) sit on actual data points; K-Means centroids (×) are displaced.
PAM SWAP Phase Convergence: Cost vs. Swap Iteration
Total cost decreases monotonically with each SWAP iteration. Unlike K-Means' smooth loss curve, PAM's cost drops in discrete steps — each step is one swap of a medoid with a non-medoid.
Gradient descent convergence — MSE decreasing over iterations
Advantages
Inherent robustness to outliers
Medoids are chosen from the actual dataset to minimize sum of distances — an outlier has a very high average distance to cluster members and will never be selected as a medoid. K-Means centroids, by contrast, are pulled toward outliers proportionally to their distance.
Works with any distance metric
K-Medoids only requires a pairwise distance function, not a Euclidean space or the concept of a mean. This enables clustering of DNA sequences (edit distance), molecules (Tanimoto), text (cosine), or any data for which meaningful similarity is defined.
Cluster representatives are real, interpretable examples
Each medoid is an actual data point from your dataset — a real customer, real transaction, real sequence. This makes cluster profiles concrete and directly actionable, unlike K-Means centroids which may represent no real entity.
Handles mixed-type data via Gower distance
By using Gower distance (which handles numeric, binary, and categorical features jointly), K-Medoids clusters mixed datasets that K-Means cannot handle correctly.
Monotone convergence guaranteed
Each SWAP step strictly reduces or maintains the total cost T(M). The algorithm always terminates in a finite number of steps with a valid local optimum.
Limitations
High computational cost for large n
PAM is O(K(n-K)² per SWAP iteration) and requires O(n²) memory for the pairwise distance matrix. For n=10,000 this means ~800MB just for the distance matrix (float64). Practically infeasible beyond n≈5,000 without approximation.
Slower than K-Means by orders of magnitude
On the same dataset, PAM is typically 10-1000x slower than K-Means. K-Means update is a simple mean computation (O(nd)); PAM's SWAP phase evaluates K(n-K) swap candidates per iteration.
Assumes K specified in advance
Like K-Means, K-Medoids requires K as input. Model selection via silhouette analysis with the custom metric adds another O(n²) computation per K value tested.
Local optimum — not guaranteed globally optimal
PAM's SWAP phase is a greedy local search — it finds a local minimum that may be far from the global optimum with fewer than n! possible configurations. Multiple initializations (n_init) help but don't guarantee global optimality.
Cannot model non-spherical clusters in metric space
K-Medoids still partitions into Voronoi cells (based on the chosen metric) — the regions are convex in metric space. Non-convex cluster shapes require DBSCAN or spectral methods.
Patient stratification with interpretable exemplars
Cluster patients by clinical measurements where each cluster medoid is a real, explainable patient profile. Physicians can examine the representative patient and reason about treatment strategies for the cluster. K-Means centroids (e.g., average age 47.3, average BMI 26.8) may be less actionable than a real patient case.
Gene sequence or protein clustering with edit distance
DNA or amino acid sequences are clustered using edit distance (Levenshtein). Since there is no meaningful 'average sequence', medoids provide interpretable cluster representatives — actual sequences that best characterize each family.
Compound library clustering with Tanimoto similarity
Chemical compounds are represented as molecular fingerprints; Tanimoto similarity is the standard metric. K-Medoids on the distance matrix (1 - Tanimoto) clusters compounds into structural families. Medoid = the most representative compound per family, used to select a diverse assay subset.
Optimal depot/store location from candidate sites
Given a set of candidate store locations and customer GPS coordinates, K-Medoids with Haversine distance finds K candidate store positions that minimize total customer travel distance. Medoids must be actual candidate sites — not arbitrary geographic midpoints.
Fraud cluster exemplar extraction
Cluster historical fraud cases by feature similarity. Each medoid is a real fraud case that best represents a fraud pattern type. New transactions are assigned to the nearest fraud medoid — distance to the medoid indicates similarity to known fraud archetypes.
Document deduplication and representative selection
For large document collections, K-Medoids with cosine distance on TF-IDF or sentence embeddings selects K representative documents. Useful for summarization (pick the most central document per topic) and deduplication (documents near the same medoid are near-duplicates).
K-Medoids occupies a specific niche: when you need K-Means-like partitioning but require real representatives, outlier robustness, or a custom metric.
K-Means
Similarity
Both minimize a within-cluster distance objective; both require K; both alternate between assignment and update
Key Difference
K-Means uses arithmetic means (not necessarily real points); minimizes squared Euclidean distance; no custom metric; faster (O(nKd) vs O(Kn²)). K-Medoids: real-point representatives, robust to outliers, any metric.
Choose When
When data is numeric, scaled, outliers are handled, speed matters, and cluster centers don't need to be real examples.
DBSCAN
Similarity
Both are unsupervised, work with custom distance metrics
Key Difference
DBSCAN discovers K automatically, handles arbitrary shapes, labels outliers explicitly, has no 'cluster representative'. K-Medoids requires K and produces K medoid representatives.
Choose When
DBSCAN: when cluster shape is unknown, outlier identification is the goal, or K is unknown. K-Medoids: when K is known and you need a real representative per cluster.
Hierarchical Clustering (Ward)
Similarity
Both work with precomputed distance matrices for custom metrics
Key Difference
Hierarchical produces a dendrogram — you can inspect the entire merge hierarchy and cut at any K without refitting. But it has no concept of 'cluster representative', and Ward linkage requires Euclidean distances.
Choose When
Hierarchical: when the merge hierarchy is informative and n < 5,000. K-Medoids: when interpretable representatives are needed and K is known.
CLARA / CLARANS
Similarity
Both are approximate versions of K-Medoids for large datasets
Key Difference
CLARA randomly subsamples and runs PAM on each subsample — fast but can miss globally optimal medoids not in any subsample. CLARANS uses a random walk through the swap graph — better quality than CLARA but harder to implement.
Choose When
CLARA/CLARANS when n > 10,000 and PAM is too slow. CLARA for simplicity; CLARANS for better quality. Both sacrifice optimality guarantees.
| Property | K-Means | K-Medoids (PAM) | CLARA | DBSCAN |
|---|---|---|---|---|
| Requires K | ✓ Yes | ✓ Yes | ✓ Yes | ✗ No |
| Outlier robust | ✗ No | ✓ Yes | ✓ Yes | ✓ Labels noise |
| Custom metric | ✗ Euclidean only | ✓ Any metric | ✓ Any metric | ✓ Any metric |
| Real representatives | ✗ Abstract means | ✓ Real points | ✓ Real points | ✗ None |
| Scalability | ✓ O(nKd) | ✗ O(Kn²) | ✓ O(Ks²) | ✓ O(n log n) |
| Non-spherical clusters | ✗ No | ✗ No | ✗ No | ✓ Yes |
Choose K-Medoids Clustering when:
You need cluster representatives to be real data points, data has outliers, you're using a non-Euclidean metric, and n < 5,000 (or you can use CLARA for larger n).
Total Cost (Inertia)
Sum of distances from each point to its assigned medoid. Units are the same as the distance metric. Lower is better. Use only to compare models with the same K and same distance metric.
Target: No absolute threshold — compare relative to K-1 and K+1 solutions
Silhouette Score (with custom metric)
Same interpretation as in K-Means, but computed using the chosen distance metric d — not necessarily Euclidean. Critical: use metric='precomputed' with the same distance matrix used for clustering.
Target: > 0.5 indicates meaningful clusters; > 0.7 indicates strong structure
Medoid Stability (Bootstrap)
Fraction of bootstrap samples in which the same medoid is selected. High stability (> 0.8) means the medoid is genuinely the most central point, not a chance artifact. Low stability → ambiguous cluster structure.
Target: > 0.8 per medoid indicates stable, reliable cluster representatives
Evaluation Process
- 01.1. Run K-Medoids for K=2..10; record total cost and silhouette per K (using the custom metric).
- 02.2. Select K at the silhouette peak — the elbow on the total-cost curve is less reliable since K-Medoids cost doesn't decrease as smoothly as K-Means inertia.
- 03.3. For the chosen K, inspect actual medoid points: retrieve X[km.medoid_indices_] and examine their features.
- 04.4. Run bootstrap stability analysis: fit K-Medoids on 50 bootstrap samples; track which points are selected as medoids.
- 05.5. Compute Davies-Bouldin index using the custom metric to corroborate silhouette.
- 06.6. Profile clusters: per-cluster mean and distribution of all features; plot medoid vs. cluster members in PCA/UMAP 2D space.
Evaluation Traps
- ▸Computing silhouette with metric='euclidean' when clustering was done with a non-Euclidean metric — this evaluates the wrong geometry.
- ▸Comparing K-Medoids total cost to K-Means inertia — these are fundamentally different quantities (distances vs. squared distances).
- ▸Using PAM on n > 10,000 without approximations — runtime will be hours or days.
- ▸Not checking medoid stability — a medoid that changes frequently across bootstrap samples suggests no clear representative exists in that cluster.
Real-World Interpretation Example
Drug compound clustering: K=5, silhouette=0.62, medoid stability=0.91. Interpretation: Silhouette 0.62 indicates well-separated compound families. Medoid stability 0.91 confirms the 5 representative compounds are robustly identified. Inspecting the 5 medoid SMILES strings reveals 5 distinct chemical scaffolds: benzimidazoles, fluoroquinolones, macrolides, aminoglycosides, and tetracyclines.
Students
- ×Confusing medoid with centroid — a medoid is always a real data point, a centroid is a mathematical mean that may not exist in the data.
- ×Using K-Medoids on large datasets without understanding the O(n²) memory and O(Kn²) time cost.
- ×Forgetting that metric='precomputed' must be set when passing a distance matrix — without it, sklearn treats the distance matrix as features.
- ×Computing silhouette score with Euclidean distance when the clustering used cosine distance — inconsistent evaluation.
Developers
- ×Not caching the pairwise distance matrix — recomputing O(n²) distances at each K value during model selection multiplies cost unnecessarily.
- ×Using PAM for n > 5,000 in production without switching to method='alternate' or CLARA — causing multi-hour runtimes.
- ×Not validating that the chosen metric satisfies the triangle inequality before computing silhouette score.
- ×Forgetting to save medoid_indices_ — after deploying, you need the actual data point records to show cluster representatives.
In Interviews
- ×Saying K-Medoids and K-Means have the same computational complexity — PAM is O(Kn²) vs. K-Means' O(nKd), a huge difference.
- ×Not knowing that K-Medoids minimizes sum of distances (L1-like) while K-Means minimizes sum of squared distances (L2) — this is why K-Medoids is more outlier-robust.
- ×Saying K-Medoids works with 'any distance metric' but not being able to name two non-Euclidean examples.
- ×Confusing CLARA (subsample PAM) with CLARANS (random walk search) — both are approximate K-Medoids but with different algorithmic approaches.
Real Projects
- ×Not inspecting the actual medoid data points — running K-Medoids without examining what medoid_indices_ refers to in the original data defeats the purpose.
- ×Using PAM's total cost to compare across different K values without normalizing — larger K always produces lower cost, just like K-Means inertia.
- ×Applying K-Medoids to text data using Euclidean distance on TF-IDF vectors instead of cosine distance — misses the angular nature of text similarity.
- ×Failing to update the distance matrix when new data arrives — medoids found on old data may not represent the updated distribution.
What kind of bias does this model have?
Bias depends on distance and shape assumptions in feature space.
What kind of variance does it have?
Variance increases when cluster structure is unstable or high-dimensional.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers scaled features with meaningful geometric distance.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- Minimizes total cost T(M) = Σᵢ d(xᵢ, nearest medoid) — medoids must be actual data points
- PAM: BUILD (greedy init) + SWAP (local search) phases; O(K(n-K)²) per SWAP pass
- More robust to outliers than K-Means: outliers contribute linearly (not quadratically) and cannot become medoids in tight clusters
- Works with any distance metric — the key advantage over K-Means
- For large n: use CLARA (subsampling PAM) or method='alternate' (randomized swap)
- Always use metric='precomputed' when passing a distance matrix to sklearn-extra
- Evaluate with silhouette using the same custom metric (not Euclidean if you used another metric)
Critical Formulas
Best For
- ✓Outlier-contaminated datasets
- ✓Non-Euclidean distance metrics (edit, cosine, Gower, Haversine)
- ✓Applications requiring real, interpretable cluster representatives
- ✓Mixed categorical and numeric data (via Gower distance)
Avoid When
- ✗n > 10,000 (use CLARA or K-Means instead)
- ✗Speed is critical and robustness to outliers is not required
- ✗Non-spherical cluster shapes (DBSCAN handles these better)
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.