In Plain English
Affinity Propagation clusters data by passing 'responsibility' and 'availability' messages between every pair of points until a set of representative exemplars emerges. It determines both the number of clusters and the exemplars simultaneously, without requiring k upfront.
Why It Exists
Most clustering algorithms require you to specify the number of clusters k before running, which is often unknown. Affinity Propagation, introduced by Frey and Dueck (Science, 2007), eliminates this requirement by letting all data points simultaneously compete to be exemplars through a message-passing process, where the structure of the data determines k automatically.
Problem It Solves
Given a dataset and a similarity matrix, find a set of exemplars (representative points) and assign every other point to the nearest exemplar, minimizing total negative similarity (maximizing total similarity) between points and their exemplars, without knowing k in advance.
Real-Life Analogy
"Imagine a group of people in a village deciding on community representatives. Each person sends two messages: 'I think you'd make a good representative' (responsibility) and 'I'm available to support you being a representative' (availability). Through repeated rounds of message-passing, natural leaders emerge based on who is well-suited and broadly supported. The number of representatives is determined organically by the community structure."
When To Use
- Number of clusters k is unknown and hard to estimate
- You need actual exemplar data points (not just centroids) as cluster representatives
- Dataset is small to medium (< 5K points) — algorithm is O(n²) in memory and time
- Similarity between points is non-Euclidean or domain-specific
- You want a reproducible deterministic result (no random initialization unlike k-means)
When NOT To Use
- Dataset is large (> 5K points) — O(n²) memory and O(n² T) time becomes prohibitive
- You already know k — k-means or GMM will be faster and often better
- Clusters are highly imbalanced — AP tends to produce roughly equal-sized clusters due to the symmetric message-passing
- Preference parameter tuning is not feasible — AP's output is sensitive to this hyperparameter
- You need soft/probabilistic cluster assignments
Every pair of points (i, k) exchanges two types of messages iteratively. The responsibility r(i, k) is sent from point i to candidate exemplar k and represents 'how well-suited k is to serve as the exemplar for i, relative to other candidate exemplars.' The availability a(i, k) is sent from candidate exemplar k to point i and represents 'how appropriate it would be for i to choose k as its exemplar, given how much support k receives from other points.'
These two messages have a push-pull dynamic: responsibility is driven by similarity (high s(i,k) → high responsibility) but suppressed by competition (if other candidates k' have higher similarity and high availability, r(i,k) is lowered). Availability is driven by how many other points support k (sum of positive responsibilities), but is self-limited to prevent over-claiming. The self-responsibility r(k,k) measures whether k has accumulated enough self-support to declare itself an exemplar.
The algorithm converges when exemplar assignments stabilize across iterations. Damping (a mixing factor λ) prevents oscillation by blending new message values with previous ones. After convergence, the exemplar for each point i is argmax_k [a(i,k) + r(i,k)] — the candidate k that maximizes the combined incoming evidence of being a good exemplar for i.
The Metaphor
"Think of responsibility and availability as votes in an election. Responsibility is like: 'I, as a voter (i), nominate you (k) for office, based on your qualifications minus the appeal of other candidates.' Availability is like: 'I, as a candidate (k), acknowledge the nominations I've received and signal to each voter how viable a choice I am.' Through repeated rounds, a natural set of winners (exemplars) emerges without pre-specifying how many seats there are."
Beginner Mental Model
Initially, every point thinks it could be an exemplar. After each iteration, points with many similar neighbors accumulate high self-responsibility (their neighbors say 'you'd be a great exemplar for me'), while isolated points lose self-responsibility. Points with high self-availability also reinforce themselves. Eventually, a handful of points — those well-suited to represent many others — stabilize as exemplars. Everyone else is assigned to their nearest exemplar.
Formal Definition
Given n data points and an n×n real-valued similarity matrix S where s(i,k) represents how well point k serves as an exemplar for point i, Affinity Propagation finds a set of exemplar indices E ⊆ {1,...,n} and an assignment function c: {1,...,n} → E such that Σᵢ s(i, c(i)) is maximized. The exemplar set is found by passing responsibility and availability messages: r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')], and a(i,k) = min(0, r(k,k) + Σ_{i'≠i,i'≠k} max(0, r(i',k))) for i ≠ k.
Key Terms
- Exemplar
- A data point selected as a cluster representative. Unlike k-means centroids, exemplars are actual data points. Every cluster has exactly one exemplar.
- Similarity s(i,k)
- A scalar measuring how well point k serves as an exemplar for point i. Often set to negative squared Euclidean distance: s(i,k) = -||xᵢ - xₖ||². Higher (less negative) = more similar.
- Responsibility r(i,k)
- Message from i to k representing the accumulated evidence that k should be the exemplar for i, relative to other candidates. Can be positive or negative.
- Availability a(i,k)
- Message from k to i representing the accumulated evidence that i should choose k as its exemplar, based on how many other points support k. Non-positive for i ≠ k.
- Preference p(k)
- Prior probability that point k is an exemplar. Set as s(k,k) — the diagonal of the similarity matrix. Higher preference → more likely to become an exemplar → fewer but larger clusters.
- Damping (λ)
- Mixing factor ∈ (0.5, 1) that blends old and new message values: r_new = (1-λ)r_computed + λ r_old. Prevents oscillation. Higher λ = more conservative updates.
- Net Availability + Responsibility
- After convergence, the exemplar for point i is argmax_k [a(i,k) + r(i,k)]. For self-exemplar check: k is an exemplar if a(k,k) + r(k,k) > 0.
Step-by-Step Working
- 1. Compute similarity matrix S (n×n). If using negative squared Euclidean: s(i,k) = -||xᵢ - xₖ||². Set diagonal s(k,k) = preference (typically median or min of S).
- 2. Initialize responsibility R = 0 (n×n), availability A = 0 (n×n).
- 3. Update responsibilities: for each (i,k), r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')].
- 4. Apply damping: R_new = (1-λ) × R_computed + λ × R_old.
- 5. Update availabilities: for k ≠ i: a(i,k) = min(0, r(k,k) + Σ_{i'∉{i,k}} max(0, r(i',k))). For diagonal: a(k,k) = Σ_{i'≠k} max(0, r(i',k)).
- 6. Apply damping to A.
- 7. Identify exemplars: E = {k : a(k,k) + r(k,k) > 0}.
- 8. Check convergence: if exemplar set E is unchanged for max_iter_converge iterations, stop. Otherwise go to step 3.
- 9. Assign each point i to nearest exemplar: c(i) = argmax_{k ∈ E} s(i,k).
Inputs
Similarity matrix S ∈ ℝⁿˣⁿ (can be asymmetric) or feature matrix X from which similarities are computed. Parameters: preference (scalar or per-point), damping λ, max_iter.
Outputs
Cluster labels for all points, indices of exemplar points, and the final A + R matrix encoding cluster structure.
Model Assumptions
Important Edge Cases
- ▸Preference too high: all points become exemplars — n clusters, each point its own cluster.
- ▸Preference too low: one or few points become exemplars — all points in one or two large clusters.
- ▸No convergence: oscillation between multiple valid exemplar sets. Increase damping λ toward 1.0.
- ▸Singular similarity (all similarities equal): no differentiation between exemplar candidates, algorithm may not converge meaningfully.
- ▸Asymmetric S: fully supported — AP does not require symmetric similarities, making it applicable to directed graphs.
Role in the ML Pipeline
Affinity Propagation sits in the unsupervised clustering stage, consuming a precomputed similarity matrix or raw features. Its output — cluster labels and exemplar indices — feeds downstream summarization, representative selection, or further analysis steps.
Data Preprocessing
- 01.Feature scaling is important when computing Euclidean similarities: StandardScaler or MinMaxScaler prevents high-range features from dominating distances.
- 02.Compute similarity matrix: s(i,k) = -||xᵢ - xₖ||² for Euclidean (default in sklearn). For text: use cosine similarity; for sequences: use edit distance negated.
- 03.Set the diagonal (preferences) carefully: sklearn defaults to the median of all pairwise similarities. Lower → fewer clusters. Higher → more clusters.
- 04.Remove or impute NaN — the similarity matrix must be fully defined.
- 05.For very large datasets: compute approximate similarities using random subsets or approximate NN and run AP on the representative subset.
Training Process
- 01.Start with default preference (median of similarities) and damping=0.5.
- 02.Check the number of exemplars produced — if too few (1–2) or too many (n), adjust preference.
- 03.If not converging, increase damping toward 0.9 or 0.95.
- 04.Sweep preference from min(S) to median(S) to max(S) and plot resulting n_clusters — find the plateau that matches domain knowledge.
- 05.Evaluate cluster quality via silhouette score for each preference setting.
Hyperparameters
Name
preference
Description
The diagonal of the similarity matrix s(k,k). Controls prior probability of each point being an exemplar. Scalar (same for all) or array (per-point).
Typical
median(S) for moderate clusters; min(S) for few large clusters; max(S) for many small clusters
Name
damping
Description
Mixing factor λ ∈ [0.5, 1) for message updates. Prevents oscillation between conflicting exemplar sets.
Typical
0.5 (minimum allowed). Increase to 0.7–0.9 if not converging.
Name
max_iter
Description
Maximum number of message-passing iterations.
Typical
200 (sklearn default). Increase to 500–2000 if convergence is slow.
Name
convergence_iter
Description
Number of iterations with no change in exemplar set to declare convergence.
Typical
15 (sklearn default)
Implementation Checklist
- 1
pip install scikit-learn numpy - 2
Scale features: StandardScaler().fit_transform(X) - 3
Compute similarities: S = -euclidean_distances(X_scaled, squared=True) or custom kernel - 4
Set preference: S[np.diag_indices_from(S)] = np.median(S) (or custom value) - 5
Fit: ap = AffinityPropagation(preference=np.median(S), damping=0.5).fit(X_scaled) - 6
Extract: labels=ap.labels_, exemplars=ap.cluster_centers_indices_ - 7
Evaluate: silhouette_score(X_scaled, labels)
1import numpy as np
2
3class AffinityPropagation:
4 """
5 Affinity Propagation clustering via message passing.
6 """
7
8 def __init__(self, damping=0.5, max_iter=200, convergence_iter=15,
9 preference=None):
10 self.damping = damping
11 self.max_iter = max_iter
12 self.convergence_iter = convergence_iter
13 self.preference = preference
14
15 # Outputs
16 self.cluster_centers_indices_ = None
17 self.labels_ = None
18 self.n_iter_ = 0
19
20 def fit(self, X):
21 n = X.shape[0]
22
23 # ── Build similarity matrix (negative squared Euclidean) ───────────────
24 # S[i,k] = -||xi - xk||^2
25 S = -np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=2)
26
27 # ── Set preference (diagonal of S) ─────────────────────────────────────
28 pref = self.preference if self.preference is not None else np.median(S)
29 np.fill_diagonal(S, pref)
30
31 # ── Initialize messages ────────────────────────────────────────────────
32 R = np.zeros((n, n)) # Responsibility matrix
33 A = np.zeros((n, n)) # Availability matrix
34
35 e = np.zeros((n, self.convergence_iter), dtype=bool) # exemplar history
36 converged = False
37
38 for it in range(self.max_iter):
39 # ── Responsibility update ──────────────────────────────────────────
40 # For each (i,k): r(i,k) = S[i,k] - max_{k'≠k}[A[i,k'] + S[i,k']]
41 AS = A + S # (n, n) — availability + similarity
42
43 # Find max over k' for each i, excluding diagonal (current k)
44 # Trick: find top-2 values per row; use 2nd-best when k == argmax
45 indices = np.argmax(AS, axis=1) # argmax per row
46 max1 = AS[np.arange(n), indices] # best value per row
47
48 # Second best: temporarily set best to -inf, take argmax again
49 AS_copy = AS.copy()
50 AS_copy[np.arange(n), indices] = -np.inf
51 max2 = AS_copy.max(axis=1) # 2nd best per row
52
53 # R_new[i,k] = S[i,k] - max_{k'≠k}(A+S)[i,k']
54 # If k == argmax: subtract max2; otherwise subtract max1
55 max_matrix = np.where(np.arange(n)[:, np.newaxis] == indices[:, np.newaxis],
56 max2[:, np.newaxis], max1[:, np.newaxis])
57 # Note: need to broadcast correctly
58 # max_matrix[i,k] = max2[i] if k == indices[i] else max1[i]
59 col_indices = np.tile(np.arange(n), (n, 1))
60 is_best = (col_indices == indices[:, np.newaxis])
61 max_matrix = np.where(is_best, max2[:, np.newaxis], max1[:, np.newaxis])
62
63 R_new = S - max_matrix
64
65 # ── Damping ────────────────────────────────────────────────────────
66 R = self.damping * R + (1 - self.damping) * R_new
67
68 # ── Availability update ────────────────────────────────────────────
69 # Positive responsibilities from non-self points
70 Rp = np.maximum(R, 0) # (n, n) — clip at 0
71 np.fill_diagonal(Rp, np.diag(R)) # diagonal uses R (unclipped)
72
73 # a(k,k) = sum_{i'≠k} max(0, r(i',k))
74 A_diag = np.sum(Rp, axis=0) - np.diag(Rp) # (n,) — column sums minus diag
75
76 # a(i,k) for i≠k = min(0, r(k,k) + sum_{i'≠i,i'≠k} max(0,r(i',k)))
77 # = min(0, A_diag[k] - max(0, R[i,k]) + R[k,k] - R[k,k])
78 # Simplified: a(i,k) = min(0, r(k,k) + col_sum_pos[k] - max(0,r(i,k)))
79 # where col_sum_pos[k] = sum of max(0, r(i',k)) over ALL i'
80 col_sum_Rp = np.sum(Rp, axis=0) # sum of clipped R per column
81 diag_R = np.diag(R) # r(k,k) for each k
82
83 # A_new[i,k] = min(0, r(k,k) + col_sum_Rp[k] - max(0, R[i,k]) - Rp[k,k])
84 # Note: Rp[k,k] = R[k,k] (diagonal is unclipped)
85 A_new = diag_R[np.newaxis, :] + col_sum_Rp[np.newaxis, :] - Rp - np.diag(Rp)[np.newaxis, :]
86 # Set diagonal: a(k,k) = A_diag[k]
87 np.fill_diagonal(A_new, A_diag)
88 # Cap off-diagonal at 0
89 A_new_off = np.minimum(A_new, 0)
90 np.fill_diagonal(A_new_off, A_diag)
91 A_new = A_new_off
92
93 # ── Damping ────────────────────────────────────────────────────────
94 A = self.damping * A + (1 - self.damping) * A_new
95
96 # ── Check for exemplars ────────────────────────────────────────────
97 D = A + R # combined evidence matrix
98 exemplars = np.where(np.diag(D) > 0)[0]
99
100 # Store exemplar set in convergence buffer
101 e[:, it % self.convergence_iter] = (np.diag(D) > 0)
102 self.n_iter_ = it + 1
103
104 # ── Convergence check ──────────────────────────────────────────────
105 if it >= self.convergence_iter:
106 # Check if exemplar set has been stable for convergence_iter steps
107 se = e.sum(axis=1)
108 if se.min() == se.max() == 0 or (e[:, 0] == e[:, -1]).all():
109 converged = True
110 break
111
112 # ── Final assignment ───────────────────────────────────────────────────
113 D = A + R
114 exemplars = np.where(np.diag(D) > 0)[0]
115
116 if len(exemplars) == 0:
117 # No exemplars found — likely all noise or bad parameters
118 self.labels_ = np.full(n, -1, dtype=int)
119 self.cluster_centers_indices_ = np.array([], dtype=int)
120 return self
121
122 # Assign each point to nearest exemplar
123 assignment = np.argmax(S[:, exemplars], axis=1)
124 self.labels_ = assignment
125 self.cluster_centers_indices_ = exemplars
126
127 # Re-map labels to 0, 1, 2, ... cluster IDs
128 for idx, ex in enumerate(exemplars):
129 self.labels_[self.labels_ == idx] = idx
130
131 if not converged:
132 print(f"Warning: did not converge in {self.max_iter} iterations.")
133
134 return self
135
136
137# ── Demo ──────────────────────────────────────────────────────────────────────
138from sklearn.datasets import make_blobs
139from sklearn.preprocessing import StandardScaler
140from sklearn.metrics import silhouette_score
141import numpy as np
142
143np.random.seed(42)
144X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
145X_scaled = StandardScaler().fit_transform(X)
146
147ap = AffinityPropagation(damping=0.5, max_iter=200, preference=None)
148ap.fit(X_scaled)
149
150print(f"Converged in : {ap.n_iter_} iterations")
151print(f"Exemplar idx : {ap.cluster_centers_indices_}")
152print(f"N clusters : {len(ap.cluster_centers_indices_)}")
153print(f"Exemplar pts :")
154for idx in ap.cluster_centers_indices_:
155 print(f" Point {idx}: {X_scaled[idx].round(3)}")
156
157if len(set(ap.labels_)) > 1:
158 sil = silhouette_score(X_scaled, ap.labels_)
159 print(f"Silhouette : {sil:.4f}")
160Sample Input
X: 400 points, 2D, 5 true clusters damping=0.5, preference=median(S), max_iter=300
Sample Output
N clusters: 5 Converged in 87 iterations Exemplar indices: [12, 87, 143, 234, 371] Silhouette: 0.612 ARI: 0.943
Key Implementation Insights
- →Passing affinity='precomputed' and fitting on the similarity matrix S directly is the most flexible approach — it allows any custom similarity function (cosine, RBF kernel, edit distance, etc.).
- →The exemplar points (cluster_centers_indices_) are actual indices into the input data — you can access the original data points directly, unlike k-means centroids which may not correspond to any real point.
- →Preference tuning is the primary lever for controlling n_clusters. A preference sweep with silhouette evaluation is the systematic way to choose it.
- →Damping must be at least 0.5 by mathematical requirement. Increase toward 0.9 if convergence_iter is consistently hit without convergence.
- →For n > 5K, AP becomes impractical due to the O(n²) similarity matrix — consider using AP on a sample or using a sparse similarity matrix.
Common Implementation Mistakes
- ✗Not sweeping preference — the default median preference rarely matches the true k. Always sweep.
- ✗Using affinity='euclidean' when you have a custom similarity — must use affinity='precomputed' with a precomputed similarity matrix.
- ✗Forgetting that exemplars are point indices, not feature values — access exemplar features as X[ap.cluster_centers_indices_].
- ✗Not increasing damping when not converging — the most common convergence issue is resolved by increasing λ.
- ✗Applying AP to n > 10K without subsampling — the O(n²) memory requirement causes OOM errors silently or explicitly.
Small Dataset (< 1K rows)
AP's ideal regime. O(n²T) computation and O(n²) memory are manageable. Convergence is fast and the message-passing produces stable, high-quality exemplars.
Medium Dataset (1K–5K rows)
Still tractable. 5K×5K similarity matrix = 25M floats ≈ 200MB. Runtime is 1–10 minutes with default settings. Quality remains high.
Large Dataset (> 10K rows)
10K×10K similarity matrix = 800MB. 100K×100K = 80GB — completely infeasible. AP is not designed for large-scale clustering.
Custom Similarity (Non-Euclidean)
AP's key strength over centroid-based methods. Any precomputed similarity (cosine, kernel, edit distance, biological sequence similarity) can be used. AP is agnostic to the metric.
High-Dimensional Data (d > 100)
AP itself handles any d since it works on the precomputed similarity matrix (n×n), not directly on d-dimensional features. Quality depends on whether the chosen similarity is meaningful in high dimensions.
Imbalanced Cluster Sizes
AP's message-passing tends to produce roughly equal-sized clusters because availability messages accumulate based on the count of supporters. Large clusters dominate and may split, while tiny natural clusters may be absorbed into larger ones.
Mandatory Visual Blueprint
What should move
At least one parameter, threshold, split, cluster state, or metric should change interactively.
What to observe
The learner should see how the concept affects error, fit, grouping, or decision quality.
Planned visual type
Interactive chart, step animation, or side-by-side failure-mode comparison.
Reference image slot
If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.
Topic key: affinity-propagation
Message Passing Convergence Over Iterations
Tracks the number of exemplars identified after each iteration. Initially volatile as messages are random, then stabilizes to the final exemplar count as responsibility and availability values converge. Damping smooths the trajectory.
Gradient descent convergence — MSE decreasing over iterations
Clusters vs. Preference Parameter
Number of clusters produced by AP as a function of the preference parameter (diagonal of S). Lower preference (more negative) → fewer, larger clusters. Higher preference (less negative / positive) → more, smaller clusters. The knee in the silhouette curve identifies the optimal preference.
Responsibility Matrix R Heatmap (After Convergence)
The n×n responsibility matrix at convergence. Strong positive values on the diagonal (r(k,k) > 0) identify exemplars. Off-diagonal entries r(i,k) show which non-exemplar points i support exemplar k. The block structure mirrors cluster assignments.
Advantages
No need to specify k
AP determines the number of clusters from data structure via the preference parameter. Unlike k-means, you don't need to know k upfront — the algorithm discovers it. Preference can be swept systematically with silhouette evaluation.
Returns actual exemplar data points
Cluster representatives are real data points (exemplars), not synthetic centroids. This is critical in applications like image retrieval (find the representative image), text summarization (find the representative document), or sensor selection (find the best sensor location from candidates).
Deterministic results
Unlike k-means with random initialization, AP's message-passing is deterministic given fixed parameters. No need for multiple restarts to avoid bad local minima. Random seeds only affect tie-breaking in sklearn.
Handles arbitrary similarity functions
AP accepts any precomputed n×n similarity matrix — cosine similarity, RBF kernel, biological sequence similarity, edit distance, or domain-specific metrics. This makes it applicable wherever a meaningful similarity is defined.
Globally optimized objective
AP's message-passing provably maximizes the total similarity between points and their exemplars (the net similarity objective). It finds the best exemplar set jointly, not greedily.
Limitations
O(n²) memory and time
The similarity matrix requires O(n²) memory — 10K points need 800MB, 100K points need 80GB. Each iteration requires O(n²) computations. This makes AP impractical for n > 5K–10K without approximations.
Sensitive to preference parameter
The preference parameter dramatically affects the number of clusters. The default (median similarity) is a reasonable starting point but rarely optimal. Sweeping preference and evaluating silhouette adds significant experimental overhead.
Convergence not guaranteed
AP may oscillate between multiple valid exemplar configurations rather than converging, especially with low damping or degenerate similarity matrices. Increasing damping helps but slows convergence.
Biased toward equal cluster sizes
The availability update accumulates supporters linearly — large clusters give their exemplar strong availability boosts, overpowering small clusters. AP tends to equalize cluster sizes, which is problematic for naturally imbalanced data.
No noise/outlier detection
Unlike DBSCAN, AP assigns every point to some exemplar. There is no concept of noise. Outliers become singleton exemplars (if preference is high) or are absorbed into the nearest real cluster (if preference is low).
Gene expression exemplar identification
Find representative gene expression profiles from microarray data. Exemplars are real genes (not synthetic centroids), enabling biological interpretation. AP with correlation-based similarity outperforms k-means on the actual Science 2007 paper's biological datasets.
Face clustering and representative selection
Cluster face images in a photo collection and find the most representative face per cluster (the exemplar). The exemplar is an actual photo that can be displayed as the cluster thumbnail — not a synthetic average face.
Document exemplar selection for summarization
Given a large document collection, find exemplar documents that best represent each topic cluster. The exemplar document is an actual article — useful for search result deduplication, news clustering, and extractive summarization.
Hub airport identification in airline networks
Similarity between airports based on shared route profiles. AP identifies hub airports as exemplars — the most central, well-connected airports in each geographic cluster. Used in airline route optimization and slot allocation.
Patient phenotype exemplar discovery
Find representative patient cases from EHR data using clinical feature similarity. Exemplar patients are real historical cases that clinicians can review — more interpretable than synthetic cluster centroids. Used in clinical decision support.
Affinity Propagation is compared most naturally to k-means (both minimize distance to cluster representatives) and DBSCAN (both avoid specifying k). Its exemplar-based approach is unique among mainstream algorithms.
K-Means
Similarity
Both assign each point to one representative; both minimize total squared distance to representatives
Key Difference
K-Means requires k upfront, uses synthetic centroids (may not be real data points), is sensitive to initialization (random restarts), and is much faster O(nkT). AP determines k, uses real exemplars, is deterministic, but is O(n²T).
Choose When
K-Means for large datasets with known k and spherical clusters. AP for small datasets where k is unknown and exemplar data points are needed.
K-Medoids (PAM)
Similarity
Both select actual data points as cluster representatives; both minimize total distance to representatives
Key Difference
K-Medoids requires k upfront and searches for medoids via combinatorial optimization. AP automatically determines k via message passing. K-Medoids is O(k(n-k)²) per iteration — also expensive but different scaling.
Choose When
K-Medoids when you know k and want medoids (guaranteed global optimum via PAM). AP when k is unknown.
DBSCAN
Similarity
Both avoid specifying k; both can produce variable numbers of clusters based on parameters
Key Difference
DBSCAN finds density-connected clusters of arbitrary shape and labels noise. AP finds exemplar-based clusters (roughly spherical if using Euclidean similarity), assigns all points, and is deterministic. DBSCAN is O(n log n); AP is O(n²).
Choose When
DBSCAN for arbitrary-shape clusters with noise on large datasets. AP for small datasets where actual exemplar data points are needed and cluster shape is roughly spherical.
Hierarchical Agglomerative Clustering
Similarity
Both avoid specifying k and produce cluster structure from pairwise distances
Key Difference
Hierarchical clustering builds a full dendrogram — you choose k by cutting at any level. AP produces a flat partition. AP is O(n²T); hierarchical is O(n² log n). AP uses message-passing; hierarchical uses bottom-up merging.
Choose When
Hierarchical when you want to explore multiple k values from a single run (dendrogram). AP when you want automatic k selection and exemplar data points.
| Property | AP | K-Means | K-Medoids | DBSCAN |
|---|---|---|---|---|
| Requires k upfront | ✗ No | ✓ Yes | ✓ Yes | ✗ No |
| Real-point exemplars | ✓ Yes | ✗ No (centroids) | ✓ Yes | N/A |
| Deterministic | ✓ Yes | ✗ No (random init) | ✗ Partial | ✓ Yes |
| Arbitrary similarity | ✓ Yes | ✗ No (Euclidean) | ✓ Yes | ✓ Yes |
| Noise detection | ✗ No | ✗ No | ✗ No | ✓ Yes |
| Time complexity | O(n²T) | O(nkT) | O(k(n-k)²) | O(n log n) |
| Memory | O(n²) | O(n+k) | O(n²) | O(n) |
Choose Affinity Propagation when:
Dataset is small (< 5K), k is unknown, you need actual data points as cluster representatives (not synthetic centroids), and you have a meaningful custom similarity function or need deterministic results.
Silhouette Score
Mean cohesion vs. separation per point. Use silhouette to evaluate different preference settings and choose the best. AP with the correct k will typically achieve silhouette > 0.5 on clean datasets.
Target: > 0.5 for well-separated clusters; use to compare preference settings
Net Similarity Objective
The sum of similarities between each point and its assigned exemplar. This is the objective AP directly maximizes. Higher is better (less negative for Euclidean-based similarity).
Target: Increases monotonically as preference increases — not useful for comparing across preference values without normalizing
Number of Exemplars
The number of clusters found. Use preference sweep to control this. Validate against domain knowledge — too few indicates over-merging, too many indicates over-fragmentation.
Target: Should match domain expectation; silhouette maximization identifies the sweet spot
Convergence Iterations
How many iterations before the exemplar set stabilized. Should be well below max_iter. If equal to max_iter, the algorithm didn't converge — increase damping.
Target: < 80% of max_iter for clean convergence
Evaluation Process
- 01.1. Compute similarity matrix S and note range: S_min, S_median, S_max.
- 02.2. Sweep preference over [S_min, S_median] in 20 logarithmically spaced steps.
- 03.3. For each preference, compute silhouette score on the resulting labels.
- 04.4. Select the preference that maximizes silhouette while producing a sensible number of clusters.
- 05.5. Check convergence: ensure n_iter_ << max_iter for the chosen preference.
- 06.6. Validate exemplars with domain knowledge — are they meaningful representatives?
- 07.7. If ground truth is available, compute ARI and NMI.
Evaluation Traps
- ▸Trusting the default preference (median) — it rarely gives the optimal k. Always sweep.
- ▸Ignoring convergence — if n_iter_ == max_iter, the results are unreliable. Increase max_iter or damping.
- ▸Computing silhouette with n=1 cluster (all labels the same) — this will raise a ValueError. Check len(unique_labels) > 1 first.
- ▸Comparing AP's net similarity objective across different preference values — it always increases with preference by construction, making it useless for selecting preference.
Real-World Interpretation Example
Gene expression dataset, n=500 genes, 50 conditions: S = correlation matrix (transformed to [-1,1] similarities). Default preference=median=0.12 → 47 clusters (too many). Sweep to preference=-0.2 → 8 clusters, silhouette=0.58. Domain experts validate: the 8 exemplar genes are known regulatory hub genes in the relevant pathways. ARI vs. known gene families = 0.74. Convergence in 43/200 iterations with damping=0.7.
Students
- ×Thinking preference is the number of clusters — it's a similarity value that biases which points become exemplars.
- ×Confusing exemplar indices (integers indexing into X) with cluster labels — ap.cluster_centers_indices_ are row indices, not cluster IDs.
- ×Expecting AP to scale to large datasets — at n=50K, the 20GB similarity matrix will crash most machines.
- ×Not checking convergence — silently using results from a non-converged run.
Developers
- ×Using affinity='euclidean' when they mean affinity='precomputed' — sklearn's affinity='euclidean' computes negative squared distances internally, but any custom similarity requires 'precomputed'.
- ×Passing the feature matrix X to fit() when affinity='precomputed' — must pass the similarity matrix S.
- ×Not increasing max_iter when warnings appear — the default 200 is often insufficient for complex similarity structures.
- ×Forgetting that A+R matrix must be checked, not just labels_, to understand convergence quality.
In Interviews
- ×Saying AP is 'like k-means but for arbitrary k' — AP's exemplar-finding mechanism is fundamentally different from k-means centroid optimization.
- ×Not knowing that responsibility is sent from point to exemplar candidate, and availability from exemplar candidate back to point.
- ×Claiming AP always finds the global optimum — it finds a local optimum of the net similarity objective via message passing, which may not be globally optimal.
- ×Not knowing the O(n²) memory limitation — this is the primary practical constraint.
Real Projects
- ×Running AP on n > 5K without checking available RAM — OOM crashes or swap thrashing.
- ×Using Euclidean similarity for text embeddings — cosine similarity is far more appropriate and often gives dramatically better clusters.
- ×Not validating exemplars with domain knowledge — the algorithm may select exemplars that are technically optimal but not meaningful to end users.
- ×Reporting n_clusters from AP without noting that it's preference-dependent — always report preference used.
What kind of bias does this model have?
Bias depends on distance and shape assumptions in feature space.
What kind of variance does it have?
Variance increases when cluster structure is unstable or high-dimensional.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers scaled features with meaningful geometric distance.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- AP passes responsibility (from point to candidate exemplar) and availability (from candidate back to point) messages iteratively
- Responsibility r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')] — similarity to k minus best competition
- Availability a(i,k) = min(0, r(k,k) + Σ_{i'} max(0, r(i',k))) — capped at 0 for non-self
- Exemplar of i = argmax_k [a(i,k) + r(i,k)] after convergence
- Preference = s(k,k) = diagonal of S. Higher → more clusters. Default = median(S)
- Damping λ ∈ [0.5, 1) prevents oscillation — increase to 0.9 if not converging
- O(n²) memory and time — limited to n < 5K in practice
Critical Formulas
Best For
- ✓Small datasets where k is unknown
- ✓Applications requiring actual exemplar data points (not synthetic centroids)
- ✓Custom/non-Euclidean similarity matrices
- ✓Deterministic clustering without random restarts
Avoid When
- ✗n > 5K (O(n²) memory/time is prohibitive)
- ✗You already know k (k-means is faster)
- ✗Highly imbalanced natural cluster sizes
- ✗Convergence cannot be achieved (check n_iter_ == max_iter)
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.