K-Medoids Clustering

Concept Overview

In Plain English

K-Medoids finds K clusters where each cluster center (medoid) must be an actual data point — not an abstract average. Every other point is assigned to the nearest medoid. This makes clusters interpretable ('the typical customer in segment 3 is customer #4521') and robust to outliers.

Why It Exists

K-Means centroids are mathematical averages that may not correspond to any real data point — in some domains this is unacceptable. A 'typical patient' with mean age 42.3, mean blood pressure 138.7 may not be a medically meaningful entity. K-Medoids was designed to guarantee that each cluster representative is a real, observable case.

Problem It Solves

Given n data points and a distance function d(·, ·) (not necessarily Euclidean), find K representative points (medoids) from the dataset such that the total cost — sum of distances from each point to its nearest medoid — is minimized.

Real-Life Analogy

"Imagine picking K team captains from a group of players. A captain must be an actual player on the team (not an imaginary average player). You want to choose captains such that each non-captain player is as close as possible to their assigned captain in skill level. K-Medoids does exactly this — except 'close' is measured by a distance function."

When To Use

Cluster representatives must be real, interpretable examples from the dataset
Data uses non-Euclidean distance metrics (edit distance, cosine, graph distance, Gower)
Data contains significant outliers — K-Medoids is inherently robust
Categorical or mixed-type data where 'mean' is not defined
Small to medium datasets (n < 10,000) where PAM's O(K(n-K)²) cost is acceptable
You need a single representative instance per cluster (e.g., a representative email, image, or drug molecule)

When NOT To Use

Dataset is very large (n > 50,000) without using CLARA/CLARANS approximations
Features are numeric, Euclidean distance applies, and outliers have been removed — use faster K-Means
Cluster representatives being real examples is not required and speed is critical
Very high dimensionality without dimensionality reduction (distance concentration still applies)

Core Intuition

K-Means minimizes the sum of squared Euclidean distances and places centroids at arithmetic means — elegant, but the mean of a set of data points is rarely one of those points. K-Medoids changes the constraint: medoids must be chosen from the actual dataset. This is a combinatorial problem with C(n, K) possible configurations — you can't solve it by taking a derivative.

PAM (Partitioning Around Medoids) solves this combinatorially in two phases. BUILD: greedily select K medoids by picking, at each step, the point that most reduces the total cost. SWAP: for each (medoid, non-medoid) pair, evaluate whether swapping them reduces the total cost; execute the swap that gives the largest cost reduction; repeat until no swap improves the objective.

The robustness to outliers is structural. In K-Means, an outlier at position x_outlier shifts the centroid toward it proportionally to 1/n (because the mean includes it). In K-Medoids, the medoid is chosen from among all n points — an outlier can only be chosen as the medoid if it truly minimizes the total distance to its cluster members, which an outlier in an otherwise tight cluster never does. The outlier simply becomes an assigned point with large distance to its medoid — it does not distort the representative.

The Metaphor

"Imagine choosing a spokesperson for each political district. K-Means would create an imaginary average citizen — a statistical ghost who may not represent anyone. K-Medoids insists the spokesperson must be a real constituent. You pick the constituent who best represents the majority, and the spokesperson's eccentricities (outlier positions) don't influence who represents other districts."

Beginner Mental Model

Like K-Means, but with a rule: you can only put cluster centers on top of actual data points. Instead of moving the center to the average of its members, you try every possible 'swap' of a center with a non-center point, and make the swap only if it reduces the total distance. Repeat until no swap helps.

Technical Theory

Formal Definition

Given a finite set X = {x⁽¹⁾, ..., x⁽ⁿ⁾} and a distance function d: X × X → ℝ≥0, K-Medoids minimizes the total cost T(M) = Σᵢ min_{mₖ ∈ M} d(xᵢ, mₖ), where M ⊂ X is a set of K medoids chosen from the data. This is equivalent to the K-Median problem in metric spaces. PAM (Partitioning Around Medoids) is the canonical exact algorithm for small-to-medium n.

Key Terms

Medoid: A data point m ∈ X that minimizes the sum of distances to all other points assigned to its cluster: m* = argmin_{x ∈ Cₖ} Σ_{x' ∈ Cₖ} d(x, x'). Unlike a centroid, a medoid is always a real observation.
PAM (Partitioning Around Medoids): The standard exact K-Medoids algorithm with two phases: BUILD (greedy initialization) and SWAP (local search to swap medoids and non-medoids to reduce total cost). Proposed by Kaufman and Rousseeuw (1987).
Total Cost T(M): The objective function: T(M) = Σᵢ d(xᵢ, nearest medoid in M). Equivalent to K-Means inertia but without squaring — K-Medoids minimizes the sum of absolute (first-power) distances, not squared distances.
BUILD phase: Greedy initialization of K medoids: start with the point minimizing total distance to all others; at each step, add the point that most reduces the current total cost. O(nK) computations.
SWAP phase: For each of the K(n-K) medoid/non-medoid pairs, compute the change in total cost ΔT if they were swapped. Execute the swap with the largest cost reduction. Repeat until no improvement. O(K(n-K)n) per pass.
CLARA (Clustering Large Applications): Approximation of PAM for large n: apply PAM to multiple random subsamples of size s ≈ 40+2K; return the medoid set with the lowest total cost on the full dataset. O(Ks²) complexity — practical for large datasets.
CLARANS (Clustering Large Applications based on Randomized Search): Randomized variant of K-Medoids that samples random (medoid, non-medoid) swap candidates instead of evaluating all K(n-K) pairs. Graph-based traversal of medoid configurations. Better solution quality than CLARA at similar cost.

Step-by-Step Working

1. Compute the full pairwise distance matrix D ∈ ℝⁿˣⁿ using chosen metric d(·, ·).
2. BUILD phase: initialize medoid set M. Select m₁ = argmin_x Σ_x' d(x, x') (most central point). For k=2..K: select the next medoid as the point that maximally reduces total cost when added to M.
3. Assign each non-medoid point to its nearest medoid: cᵢ = argmin_{mₖ ∈ M} d(xᵢ, mₖ).
4. SWAP phase: for each medoid mₖ ∈ M and each non-medoid xᵢ ∉ M, compute ΔT(mₖ ↔ xᵢ) = total cost if mₖ is replaced by xᵢ.
5. Execute the swap (mₖ*, xᵢ*) = argmin ΔT if min ΔT < 0 (cost reduction).
6. After each swap, recompute all assignments and ΔT values.
7. Repeat steps 4–6 until no improving swap exists (convergence).
8. Return final medoids M and assignments c.

Inputs

Pairwise distance matrix D ∈ ℝⁿˣⁿ (or a feature matrix X with a specified metric). Integer K. Distance metric d(·, ·) which can be any valid metric.

Outputs

K medoid indices (actual data point indices from 0 to n-1), cluster labels c ∈ {0,...,K-1}ⁿ, total cost T(M).

Model Assumptions

01Distance function d is a valid metric: non-negativity, symmetry, identity of indiscernibles, and triangle inequality (triangle inequality not strictly required but beneficial).

02Cluster representatives being real data points is meaningful in the domain.

03n is manageable: PAM is O(K(n-K)²) per iteration — for n > 10,000, use CLARA or CLARANS.

04K is specified in advance — use silhouette analysis (with the actual metric d) to select K.

Important Edge Cases

▸K = 1: the single medoid is the most central point (minimizes total distance to all others) — the L1 analogue of the mean.
▸K = n: every point is its own medoid. Total cost = 0. Degenerate — identical to K-Means' K=n case.
▸Ties in ΔT: multiple swaps with identical cost reduction. Break ties arbitrarily — the final solution may differ, but the objective value is the same.
▸Disconnected metric space: if d(xᵢ, xⱼ) = ∞ for some pairs, the algorithm will assign points to medoids within reachable components only.
▸Non-metric distances: if d violates the triangle inequality (e.g., some similarity measures), the BUILD phase may produce suboptimal initialization, but SWAP still converges.

Methodology / Workflow

Role in the ML Pipeline

K-Medoids sits after distance matrix computation (or feature engineering) and before cluster analysis or downstream use. It is common in pipelines where the distance/similarity is the primary data artifact (e.g., sequence alignment scores, graph edit distances).

Data Preprocessing

01.Compute pairwise distance matrix: for Euclidean data use scipy.spatial.distance.pdist; for custom metrics implement pairwise computation.
02.Choose the distance metric thoughtfully: Manhattan (L1) for high-dimensional tabular data, cosine for text/embeddings, Gower distance for mixed categorical/numeric data, edit distance for strings, graph edit distance for molecules.
03.For CLARA: ensure subsample size s ≥ 40 + 2K to ensure statistical coverage.
04.Normalize features if using Euclidean or Manhattan distance — same as K-Means.
05.Remove duplicate points: identical points will always be co-assigned and can confuse medoid selection in early BUILD steps.

Training Process

01.Compute or obtain the n×n pairwise distance matrix D.
02.Run PAM with multiple random seeds (n_init≥5) or use the BUILD initialization and verify with SWAP.
03.Select K using silhouette analysis computed with the actual distance metric d (not Euclidean unless that is the chosen metric).
04.Inspect medoids: each medoid index points to an actual data point — examine that record to understand what the cluster represents.
05.Profile clusters by computing cluster-level statistics on original features.
06.For large n: use CLARA with num_samples=10 and subsample_size=max(40+2K, 1000).

Hyperparameters

Name

n_clusters (K)

Description

Number of medoids (clusters). Same role as in K-Means.

Typical

3–8 for most applications; use silhouette score with the chosen metric d

Name

metric / distance function

Description

The distance function d(·, ·) used to compute the cost matrix. This is the most impactful decision in K-Medoids and has no equivalent in K-Means.

Typical

Euclidean for numeric, cosine for text embeddings, Gower for mixed data

Name

init

Description

Initialization method: 'build' (PAM BUILD phase — greedy, better quality) or 'random' (random K medoids).

Typical

'build' for PAM; for CLARA, random subsampling

Name

max_iter

Description

Maximum number of SWAP phase passes.

Typical

300 (default in sklearn-extra); rarely binding — PAM typically converges in < 20 passes

Implementation Checklist

1pip install scikit-learn-extra numpy scipy
2Compute pairwise distances: scipy.spatial.distance.cdist(X, X, metric='euclidean')
3Instantiate KMedoids(n_clusters=K, metric='euclidean', method='pam', init='build')
4Fit: km.fit(X) or km.fit(distance_matrix) with metric='precomputed'
5Inspect medoids: km.medoid_indices_ — these are actual dataset row indices
6Evaluate: silhouette_score(distance_matrix, labels, metric='precomputed')
7For large n: use KMedoids with method='alternate' (faster) or CLARA approximation

Mathematical Chamber

Implementation

python

1import numpy as np
2from itertools import combinations
3
4class KMedoids:
5    """
6    Partitioning Around Medoids (PAM) - exact K-Medoids implementation.
7    Accepts a precomputed distance matrix for metric flexibility.
8    """
9
10    def __init__(self, n_clusters=3, max_iter=300, random_state=None, init='build'):
11        self.K = n_clusters
12        self.max_iter = max_iter
13        self.rng = np.random.default_rng(random_state)
14        self.init = init
15
16        self.medoid_indices_ = None
17        self.labels_ = None
18        self.inertia_ = None   # total cost (sum of distances, not squared)
19
20    # ── Utility ────────────────────────────────────────────────────────────────
21    def _assign(self, D, medoids):
22        """Assign each point to nearest medoid. Returns labels and total cost."""
23        dist_to_medoids = D[:, medoids]          # (n, K)
24        labels = np.argmin(dist_to_medoids, axis=1)   # cluster index (0..K-1)
25        cost = dist_to_medoids[np.arange(len(D)), labels].sum()
26        return labels, cost
27
28    # ── PAM BUILD Phase ────────────────────────────────────────────────────────
29    def _build(self, D):
30        n = len(D)
31        medoids = []
32
33        # Step 1: pick the most central point
34        first = np.argmin(D.sum(axis=1))
35        medoids.append(first)
36
37        # Step 2: greedily add medoids
38        for _ in range(1, self.K):
39            best_gain = np.inf
40            best_candidate = -1
41            for candidate in range(n):
42                if candidate in medoids:
43                    continue
44                # Total cost if we add 'candidate' to medoid set
45                current_costs = D[:, medoids].min(axis=1)          # (n,)
46                new_costs = np.minimum(current_costs, D[:, candidate])
47                gain = new_costs.sum()
48                if gain < best_gain:
49                    best_gain = gain
50                    best_candidate = candidate
51            medoids.append(best_candidate)
52
53        return medoids
54
55    # ── PAM SWAP Phase ─────────────────────────────────────────────────────────
56    def _swap(self, D, medoids):
57        n = len(D)
58        medoids = list(medoids)
59        improved = True
60        n_iter = 0
61
62        while improved and n_iter < self.max_iter:
63            improved = False
64            n_iter += 1
65
66            best_delta = 0
67            best_swap = None
68
69            non_medoids = [i for i in range(n) if i not in medoids]
70
71            for k_idx, mk in enumerate(medoids):
72                for xh in non_medoids:
73                    # Compute ΔT for swapping medoid mk with non-medoid xh
74                    # For each non-medoid point xi, determine:
75                    #   - d1(xi): distance to current nearest medoid
76                    #   - d2(xi): distance to second nearest medoid (next best if mk removed)
77                    test_medoids = medoids.copy()
78                    test_medoids[k_idx] = xh
79                    _, test_cost = self._assign(D, test_medoids)
80                    _, current_cost = self._assign(D, medoids)
81                    delta = test_cost - current_cost
82
83                    if delta < best_delta:
84                        best_delta = delta
85                        best_swap = (k_idx, xh)
86
87            if best_swap is not None:
88                k_idx, xh = best_swap
89                old_medoid = medoids[k_idx]
90                medoids[k_idx] = xh
91                improved = True
92
93        labels, cost = self._assign(D, medoids)
94        return medoids, labels, cost, n_iter
95
96    def fit(self, D):
97        """
98        Fit K-Medoids to a precomputed distance matrix D (n x n).
99        """
100        D = np.asarray(D, dtype=float)
101        assert D.shape[0] == D.shape[1], "D must be a square distance matrix"
102
103        if self.init == 'build':
104            medoids = self._build(D)
105        else:
106            n = len(D)
107            medoids = list(self.rng.choice(n, size=self.K, replace=False))
108
109        medoids, labels, cost, n_iter = self._swap(D, medoids)
110
111        self.medoid_indices_ = np.array(medoids)
112        self.labels_ = labels
113        self.inertia_ = cost
114        self.n_iter_ = n_iter
115        return self
116
117    def predict(self, D_new):
118        """
119        D_new: (m, n) distances from m new points to all n training points.
120        Returns cluster labels for the m new points.
121        """
122        dist_to_medoids = D_new[:, self.medoid_indices_]
123        return np.argmin(dist_to_medoids, axis=1)
124
125
126# ── Demo ──────────────────────────────────────────────────────────────────────
127np.random.seed(42)
128X = np.vstack([
129    np.random.randn(30, 2) + [0, 0],
130    np.random.randn(30, 2) + [6, 0],
131    np.random.randn(30, 2) + [3, 5],
132])
133# Add a strong outlier
134X = np.vstack([X, [[20, 20]]])
135
136# Compute Euclidean distance matrix
137from scipy.spatial.distance import cdist
138D = cdist(X, X, metric='euclidean')
139
140km = KMedoids(n_clusters=3, random_state=42, init='build')
141km.fit(D)
142
143print(f"Medoid indices: {km.medoid_indices_}")
144print(f"Medoid points:\n{X[km.medoid_indices_].round(3)}")
145print(f"Total cost:    {km.inertia_:.4f}")
146print(f"Converged in {km.n_iter_} SWAP passes")
147print(f"Outlier (idx 90) assigned to cluster: {km.labels_[90]}")
148print(f"  Distance to its medoid: {D[90, km.medoid_indices_[km.labels_[90]]]:.2f}")

The from-scratch implementation uses precomputed distance matrix D for metric flexibility. The SWAP phase loops over all K(n-K) swap candidates — the exact PAM algorithm. For n=91 this is fast; for n>1000 use the sklearn-extra implementation with FastPAM.

Sample Input

X shape: (242, 3), 3 true Gaussian clusters + 2 outliers at [30,30,30] and [-30,-30,-30]

Sample Output

Medoid indices: [42, 165, 213]
Medoid coords: [[ 0.1, 0.2, 0.0], [8.0, 0.1, 0.1], [4.1, 7.0, 0.0]]
Total cost: 371.42
Silhouette: 0.7812
Outliers remain in clusters but don't distort medoids

Key Implementation Insights

→K-Medoids accepts precomputed distance matrices — this is the key advantage enabling any metric: cosine, edit distance, Gower, graph-theoretic.
→For n > 5,000, use method='alternate' (randomized swap) in sklearn-extra — it achieves similar quality to PAM with O(n) per iteration instead of O(Kn²).
→medoid_indices_ contains actual row indices from your dataset — inspect X[kmed.medoid_indices_] to understand what each cluster represents.
→Total cost (inertia_) uses raw distances, not squared. Comparing K-Medoids cost to K-Means inertia is apples-to-oranges.
→Silhouette score for K-Medoids should be computed with metric='precomputed' using the same distance matrix used for clustering — not Euclidean if a different metric was used.

Common Implementation Mistakes

✗Using K-Medoids on n > 10,000 with PAM method without switching to CLARA/alternate — O(Kn²) becomes infeasible.
✗Not using metric='precomputed' when passing a distance matrix — sklearn will interpret it as a feature matrix and compute Euclidean distances on top of your distances.
✗Computing silhouette score with metric='euclidean' when clustering was done with a different metric — inconsistent evaluation.
✗Expecting K-Medoids to be faster than K-Means — it is always slower; use K-Medoids only when its properties (real representatives, robustness, custom metrics) are needed.

Dataset Applicability

📉

Data with Outliers

Excellent

K-Medoids' defining advantage. Outliers contribute linearly to the total cost (not quadratically as in K-Means) and cannot distort medoids — an outlier in a tight cluster will never be selected as the medoid because it has the highest within-cluster distance sum.

💡 Quantify outlier robustness by comparing K-Means and K-Medoids centroids/medoids on contaminated data — medoids remain stable while centroids shift.

🧬

Non-Euclidean Distance Data

Excellent

The primary use case: DNA sequences (edit distance), drug molecules (Tanimoto similarity), text (cosine/Jaccard), social graphs (shortest path distance). K-Means cannot handle these; K-Medoids accepts any metric.

💡 Pass precomputed distance matrix with metric='precomputed'. Ensure distance matrix satisfies metric properties for best results.

📊

Small Dataset (< 5K samples)

Excellent

PAM's exact O(K(n-K)²) algorithm is tractable for n < 5,000. Provides exact solutions with interpretable medoid representatives.

💡 For n < 1,000, PAM is fast and exact. Prefer it over approximate methods when exact medoids matter.

🗄️

Large Dataset (> 50K samples)

Poor

PAM scales quadratically with n — infeasible for large datasets. Full pairwise distance matrix alone requires O(n²) memory (50K points = 20GB for float64).

💡 Use CLARA (sklearn-extra's method='clara') or K-Means as substitute. CLARANS offers better quality-speed trade-off than CLARA.

⚖️

Mixed Categorical and Numeric Data

Good

With Gower distance (gower library), K-Medoids handles mixed data naturally. Gower distance normalizes each feature and uses Manhattan for numeric, Jaccard for binary, and simple matching for categorical.

💡 pip install gower; compute D = gower.gower_matrix(df); use KMedoids(metric='precomputed').

📐

High-Dimensional Data

Context-Dependent

Same curse of dimensionality as K-Means — Euclidean distances concentrate. With Manhattan (L1) distance, K-Medoids is somewhat more robust in high dimensions than K-Means with Euclidean.

💡 Apply PCA before clustering, or use cosine distance (for embeddings) which doesn't concentrate as severely as Euclidean in high dimensions.

Visualizations

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: k-medoid

K-Medoids vs K-Means Cluster Centers with Outlier

Scatter plot showing how K-Means centroids are pulled toward outliers while K-Medoids medoids remain inside the true cluster. Medoid markers (★) sit on actual data points; K-Means centroids (×) are displaced.

Medoids (★) stay within the main cluster body. K-Means centroids (×) shift toward the outlier. This illustrates K-Medoids' outlier robustness.

PAM SWAP Phase Convergence: Cost vs. Swap Iteration

Total cost decreases monotonically with each SWAP iteration. Unlike K-Means' smooth loss curve, PAM's cost drops in discrete steps — each step is one swap of a medoid with a non-medoid.

Gradient descent convergence — MSE decreasing over iterations

Advantages & Limitations

Advantages

Inherent robustness to outliers
Medoids are chosen from the actual dataset to minimize sum of distances — an outlier has a very high average distance to cluster members and will never be selected as a medoid. K-Means centroids, by contrast, are pulled toward outliers proportionally to their distance.
Works with any distance metric
K-Medoids only requires a pairwise distance function, not a Euclidean space or the concept of a mean. This enables clustering of DNA sequences (edit distance), molecules (Tanimoto), text (cosine), or any data for which meaningful similarity is defined.
Cluster representatives are real, interpretable examples
Each medoid is an actual data point from your dataset — a real customer, real transaction, real sequence. This makes cluster profiles concrete and directly actionable, unlike K-Means centroids which may represent no real entity.
Handles mixed-type data via Gower distance
By using Gower distance (which handles numeric, binary, and categorical features jointly), K-Medoids clusters mixed datasets that K-Means cannot handle correctly.
Monotone convergence guaranteed
Each SWAP step strictly reduces or maintains the total cost T(M). The algorithm always terminates in a finite number of steps with a valid local optimum.

Limitations

High computational cost for large n
PAM is O(K(n-K)² per SWAP iteration) and requires O(n²) memory for the pairwise distance matrix. For n=10,000 this means ~800MB just for the distance matrix (float64). Practically infeasible beyond n≈5,000 without approximation.
Slower than K-Means by orders of magnitude
On the same dataset, PAM is typically 10-1000x slower than K-Means. K-Means update is a simple mean computation (O(nd)); PAM's SWAP phase evaluates K(n-K) swap candidates per iteration.
Assumes K specified in advance
Like K-Means, K-Medoids requires K as input. Model selection via silhouette analysis with the custom metric adds another O(n²) computation per K value tested.
Local optimum — not guaranteed globally optimal
PAM's SWAP phase is a greedy local search — it finds a local minimum that may be far from the global optimum with fewer than n! possible configurations. Multiple initializations (n_init) help but don't guarantee global optimality.
Cannot model non-spherical clusters in metric space
K-Medoids still partitions into Voronoi cells (based on the chosen metric) — the regions are convex in metric space. Non-convex cluster shapes require DBSCAN or spectral methods.

Practical Use Cases

Healthcare / Clinical Research

Patient stratification with interpretable exemplars

Cluster patients by clinical measurements where each cluster medoid is a real, explainable patient profile. Physicians can examine the representative patient and reason about treatment strategies for the cluster. K-Means centroids (e.g., average age 47.3, average BMI 26.8) may be less actionable than a real patient case.

Bioinformatics

Gene sequence or protein clustering with edit distance

DNA or amino acid sequences are clustered using edit distance (Levenshtein). Since there is no meaningful 'average sequence', medoids provide interpretable cluster representatives — actual sequences that best characterize each family.

Drug Discovery / Cheminformatics

Compound library clustering with Tanimoto similarity

Chemical compounds are represented as molecular fingerprints; Tanimoto similarity is the standard metric. K-Medoids on the distance matrix (1 - Tanimoto) clusters compounds into structural families. Medoid = the most representative compound per family, used to select a diverse assay subset.

Retail / Logistics

Optimal depot/store location from candidate sites

Given a set of candidate store locations and customer GPS coordinates, K-Medoids with Haversine distance finds K candidate store positions that minimize total customer travel distance. Medoids must be actual candidate sites — not arbitrary geographic midpoints.

Fraud Detection

Fraud cluster exemplar extraction

Cluster historical fraud cases by feature similarity. Each medoid is a real fraud case that best represents a fraud pattern type. New transactions are assigned to the nearest fraud medoid — distance to the medoid indicates similarity to known fraud archetypes.

NLP / Information Retrieval

Document deduplication and representative selection

For large document collections, K-Medoids with cosine distance on TF-IDF or sentence embeddings selects K representative documents. Useful for summarization (pick the most central document per topic) and deduplication (documents near the same medoid are near-duplicates).

Comparison

K-Medoids occupies a specific niche: when you need K-Means-like partitioning but require real representatives, outlier robustness, or a custom metric.

K-Means

Similarity

Both minimize a within-cluster distance objective; both require K; both alternate between assignment and update

Key Difference

K-Means uses arithmetic means (not necessarily real points); minimizes squared Euclidean distance; no custom metric; faster (O(nKd) vs O(Kn²)). K-Medoids: real-point representatives, robust to outliers, any metric.

Choose When

When data is numeric, scaled, outliers are handled, speed matters, and cluster centers don't need to be real examples.

DBSCAN

Similarity

Both are unsupervised, work with custom distance metrics

Key Difference

DBSCAN discovers K automatically, handles arbitrary shapes, labels outliers explicitly, has no 'cluster representative'. K-Medoids requires K and produces K medoid representatives.

Choose When

DBSCAN: when cluster shape is unknown, outlier identification is the goal, or K is unknown. K-Medoids: when K is known and you need a real representative per cluster.

Hierarchical Clustering (Ward)

Similarity

Both work with precomputed distance matrices for custom metrics

Key Difference

Hierarchical produces a dendrogram — you can inspect the entire merge hierarchy and cut at any K without refitting. But it has no concept of 'cluster representative', and Ward linkage requires Euclidean distances.

Choose When

Hierarchical: when the merge hierarchy is informative and n < 5,000. K-Medoids: when interpretable representatives are needed and K is known.

CLARA / CLARANS

Similarity

Both are approximate versions of K-Medoids for large datasets

Key Difference

CLARA randomly subsamples and runs PAM on each subsample — fast but can miss globally optimal medoids not in any subsample. CLARANS uses a random walk through the swap graph — better quality than CLARA but harder to implement.

Choose When

CLARA/CLARANS when n > 10,000 and PAM is too slow. CLARA for simplicity; CLARANS for better quality. Both sacrifice optimality guarantees.

Property	K-Means	K-Medoids (PAM)	CLARA	DBSCAN
Requires K	✓ Yes	✓ Yes	✓ Yes	✗ No
Outlier robust	✗ No	✓ Yes	✓ Yes	✓ Labels noise
Custom metric	✗ Euclidean only	✓ Any metric	✓ Any metric	✓ Any metric
Real representatives	✗ Abstract means	✓ Real points	✓ Real points	✗ None
Scalability	✓ O(nKd)	✗ O(Kn²)	✓ O(Ks²)	✓ O(n log n)
Non-spherical clusters	✗ No	✗ No	✗ No	✓ Yes

Choose K-Medoids Clustering when:

You need cluster representatives to be real data points, data has outliers, you're using a non-Euclidean metric, and n < 5,000 (or you can use CLARA for larger n).

Evaluation

Total Cost (Inertia)

Sum of distances from each point to its assigned medoid. Units are the same as the distance metric. Lower is better. Use only to compare models with the same K and same distance metric.

Target: No absolute threshold — compare relative to K-1 and K+1 solutions

Silhouette Score (with custom metric)

Same interpretation as in K-Means, but computed using the chosen distance metric d — not necessarily Euclidean. Critical: use metric='precomputed' with the same distance matrix used for clustering.

Target: > 0.5 indicates meaningful clusters; > 0.7 indicates strong structure

Medoid Stability (Bootstrap)

Fraction of bootstrap samples in which the same medoid is selected. High stability (> 0.8) means the medoid is genuinely the most central point, not a chance artifact. Low stability → ambiguous cluster structure.

Target: > 0.8 per medoid indicates stable, reliable cluster representatives

Evaluation Process

01.1. Run K-Medoids for K=2..10; record total cost and silhouette per K (using the custom metric).
02.2. Select K at the silhouette peak — the elbow on the total-cost curve is less reliable since K-Medoids cost doesn't decrease as smoothly as K-Means inertia.
03.3. For the chosen K, inspect actual medoid points: retrieve X[km.medoid_indices_] and examine their features.
04.4. Run bootstrap stability analysis: fit K-Medoids on 50 bootstrap samples; track which points are selected as medoids.
05.5. Compute Davies-Bouldin index using the custom metric to corroborate silhouette.
06.6. Profile clusters: per-cluster mean and distribution of all features; plot medoid vs. cluster members in PCA/UMAP 2D space.

Evaluation Traps

▸Computing silhouette with metric='euclidean' when clustering was done with a non-Euclidean metric — this evaluates the wrong geometry.
▸Comparing K-Medoids total cost to K-Means inertia — these are fundamentally different quantities (distances vs. squared distances).
▸Using PAM on n > 10,000 without approximations — runtime will be hours or days.
▸Not checking medoid stability — a medoid that changes frequently across bootstrap samples suggests no clear representative exists in that cluster.

Real-World Interpretation Example

Drug compound clustering: K=5, silhouette=0.62, medoid stability=0.91. Interpretation: Silhouette 0.62 indicates well-separated compound families. Medoid stability 0.91 confirms the 5 representative compounds are robustly identified. Inspecting the 5 medoid SMILES strings reveals 5 distinct chemical scaffolds: benzimidazoles, fluoroquinolones, macrolides, aminoglycosides, and tetracyclines.

Common Mistakes

Students

×Confusing medoid with centroid — a medoid is always a real data point, a centroid is a mathematical mean that may not exist in the data.
×Using K-Medoids on large datasets without understanding the O(n²) memory and O(Kn²) time cost.
×Forgetting that metric='precomputed' must be set when passing a distance matrix — without it, sklearn treats the distance matrix as features.
×Computing silhouette score with Euclidean distance when the clustering used cosine distance — inconsistent evaluation.

Developers

×Not caching the pairwise distance matrix — recomputing O(n²) distances at each K value during model selection multiplies cost unnecessarily.
×Using PAM for n > 5,000 in production without switching to method='alternate' or CLARA — causing multi-hour runtimes.
×Not validating that the chosen metric satisfies the triangle inequality before computing silhouette score.
×Forgetting to save medoid_indices_ — after deploying, you need the actual data point records to show cluster representatives.

In Interviews

×Saying K-Medoids and K-Means have the same computational complexity — PAM is O(Kn²) vs. K-Means' O(nKd), a huge difference.
×Not knowing that K-Medoids minimizes sum of distances (L1-like) while K-Means minimizes sum of squared distances (L2) — this is why K-Medoids is more outlier-robust.
×Saying K-Medoids works with 'any distance metric' but not being able to name two non-Euclidean examples.
×Confusing CLARA (subsample PAM) with CLARANS (random walk search) — both are approximate K-Medoids but with different algorithmic approaches.

Real Projects

×Not inspecting the actual medoid data points — running K-Medoids without examining what medoid_indices_ refers to in the original data defeats the purpose.
×Using PAM's total cost to compare across different K values without normalizing — larger K always produces lower cost, just like K-Means inertia.
×Applying K-Medoids to text data using Euclidean distance on TF-IDF vectors instead of cosine distance — misses the angular nature of text similarity.
×Failing to update the distance matrix when new data arrives — medoids found on old data may not represent the updated distribution.

Core ML Thinking Lens

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

Minimizes total cost T(M) = Σᵢ d(xᵢ, nearest medoid) — medoids must be actual data points
PAM: BUILD (greedy init) + SWAP (local search) phases; O(K(n-K)²) per SWAP pass
More robust to outliers than K-Means: outliers contribute linearly (not quadratically) and cannot become medoids in tight clusters
Works with any distance metric — the key advantage over K-Means
For large n: use CLARA (subsampling PAM) or method='alternate' (randomized swap)
Always use metric='precomputed' when passing a distance matrix to sklearn-extra
Evaluate with silhouette using the same custom metric (not Euclidean if you used another metric)

Critical Formulas

Total Cost

Medoid

Assignment

SWAP Condition

Best For

✓Outlier-contaminated datasets
✓Non-Euclidean distance metrics (edit, cosine, Gower, Haversine)
✓Applications requiring real, interpretable cluster representatives
✓Mixed categorical and numeric data (via Gower distance)

Avoid When

✗n > 10,000 (use CLARA or K-Means instead)
✗Speed is critical and robustness to outliers is not required
✗Non-spherical cluster shapes (DBSCAN handles these better)

Interview Must-Know

★Explain medoid vs. centroid and the representability constraint

★Describe PAM's BUILD and SWAP phases and their complexities

★Explain why K-Medoids is robust to outliers (linear vs. quadratic cost, medoid selection criterion)

★Give two examples of non-Euclidean metrics where K-Medoids excels

★Explain CLARA and when to use it instead of PAM

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.