Affinity Propagation

Concept Overview

In Plain English

Affinity Propagation clusters data by passing 'responsibility' and 'availability' messages between every pair of points until a set of representative exemplars emerges. It determines both the number of clusters and the exemplars simultaneously, without requiring k upfront.

Why It Exists

Most clustering algorithms require you to specify the number of clusters k before running, which is often unknown. Affinity Propagation, introduced by Frey and Dueck (Science, 2007), eliminates this requirement by letting all data points simultaneously compete to be exemplars through a message-passing process, where the structure of the data determines k automatically.

Problem It Solves

Given a dataset and a similarity matrix, find a set of exemplars (representative points) and assign every other point to the nearest exemplar, minimizing total negative similarity (maximizing total similarity) between points and their exemplars, without knowing k in advance.

Real-Life Analogy

"Imagine a group of people in a village deciding on community representatives. Each person sends two messages: 'I think you'd make a good representative' (responsibility) and 'I'm available to support you being a representative' (availability). Through repeated rounds of message-passing, natural leaders emerge based on who is well-suited and broadly supported. The number of representatives is determined organically by the community structure."

When To Use

Number of clusters k is unknown and hard to estimate
You need actual exemplar data points (not just centroids) as cluster representatives
Dataset is small to medium (< 5K points) — algorithm is O(n²) in memory and time
Similarity between points is non-Euclidean or domain-specific
You want a reproducible deterministic result (no random initialization unlike k-means)

When NOT To Use

Dataset is large (> 5K points) — O(n²) memory and O(n² T) time becomes prohibitive
You already know k — k-means or GMM will be faster and often better
Clusters are highly imbalanced — AP tends to produce roughly equal-sized clusters due to the symmetric message-passing
Preference parameter tuning is not feasible — AP's output is sensitive to this hyperparameter
You need soft/probabilistic cluster assignments

Core Intuition

Every pair of points (i, k) exchanges two types of messages iteratively. The responsibility r(i, k) is sent from point i to candidate exemplar k and represents 'how well-suited k is to serve as the exemplar for i, relative to other candidate exemplars.' The availability a(i, k) is sent from candidate exemplar k to point i and represents 'how appropriate it would be for i to choose k as its exemplar, given how much support k receives from other points.'

These two messages have a push-pull dynamic: responsibility is driven by similarity (high s(i,k) → high responsibility) but suppressed by competition (if other candidates k' have higher similarity and high availability, r(i,k) is lowered). Availability is driven by how many other points support k (sum of positive responsibilities), but is self-limited to prevent over-claiming. The self-responsibility r(k,k) measures whether k has accumulated enough self-support to declare itself an exemplar.

The algorithm converges when exemplar assignments stabilize across iterations. Damping (a mixing factor λ) prevents oscillation by blending new message values with previous ones. After convergence, the exemplar for each point i is argmax_k [a(i,k) + r(i,k)] — the candidate k that maximizes the combined incoming evidence of being a good exemplar for i.

The Metaphor

"Think of responsibility and availability as votes in an election. Responsibility is like: 'I, as a voter (i), nominate you (k) for office, based on your qualifications minus the appeal of other candidates.' Availability is like: 'I, as a candidate (k), acknowledge the nominations I've received and signal to each voter how viable a choice I am.' Through repeated rounds, a natural set of winners (exemplars) emerges without pre-specifying how many seats there are."

Beginner Mental Model

Initially, every point thinks it could be an exemplar. After each iteration, points with many similar neighbors accumulate high self-responsibility (their neighbors say 'you'd be a great exemplar for me'), while isolated points lose self-responsibility. Points with high self-availability also reinforce themselves. Eventually, a handful of points — those well-suited to represent many others — stabilize as exemplars. Everyone else is assigned to their nearest exemplar.

Technical Theory

Formal Definition

Given n data points and an n×n real-valued similarity matrix S where s(i,k) represents how well point k serves as an exemplar for point i, Affinity Propagation finds a set of exemplar indices E ⊆ {1,...,n} and an assignment function c: {1,...,n} → E such that Σᵢ s(i, c(i)) is maximized. The exemplar set is found by passing responsibility and availability messages: r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')], and a(i,k) = min(0, r(k,k) + Σ_{i'≠i,i'≠k} max(0, r(i',k))) for i ≠ k.

Key Terms

Exemplar: A data point selected as a cluster representative. Unlike k-means centroids, exemplars are actual data points. Every cluster has exactly one exemplar.
Similarity s(i,k): A scalar measuring how well point k serves as an exemplar for point i. Often set to negative squared Euclidean distance: s(i,k) = -||xᵢ - xₖ||². Higher (less negative) = more similar.
Responsibility r(i,k): Message from i to k representing the accumulated evidence that k should be the exemplar for i, relative to other candidates. Can be positive or negative.
Availability a(i,k): Message from k to i representing the accumulated evidence that i should choose k as its exemplar, based on how many other points support k. Non-positive for i ≠ k.
Preference p(k): Prior probability that point k is an exemplar. Set as s(k,k) — the diagonal of the similarity matrix. Higher preference → more likely to become an exemplar → fewer but larger clusters.
Damping (λ): Mixing factor ∈ (0.5, 1) that blends old and new message values: r_new = (1-λ)r_computed + λ r_old. Prevents oscillation. Higher λ = more conservative updates.
Net Availability + Responsibility: After convergence, the exemplar for point i is argmax_k [a(i,k) + r(i,k)]. For self-exemplar check: k is an exemplar if a(k,k) + r(k,k) > 0.

Step-by-Step Working

1. Compute similarity matrix S (n×n). If using negative squared Euclidean: s(i,k) = -||xᵢ - xₖ||². Set diagonal s(k,k) = preference (typically median or min of S).
2. Initialize responsibility R = 0 (n×n), availability A = 0 (n×n).
3. Update responsibilities: for each (i,k), r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')].
4. Apply damping: R_new = (1-λ) × R_computed + λ × R_old.
5. Update availabilities: for k ≠ i: a(i,k) = min(0, r(k,k) + Σ_{i'∉{i,k}} max(0, r(i',k))). For diagonal: a(k,k) = Σ_{i'≠k} max(0, r(i',k)).
6. Apply damping to A.
7. Identify exemplars: E = {k : a(k,k) + r(k,k) > 0}.
8. Check convergence: if exemplar set E is unchanged for max_iter_converge iterations, stop. Otherwise go to step 3.
9. Assign each point i to nearest exemplar: c(i) = argmax_{k ∈ E} s(i,k).

Inputs

Similarity matrix S ∈ ℝⁿˣⁿ (can be asymmetric) or feature matrix X from which similarities are computed. Parameters: preference (scalar or per-point), damping λ, max_iter.

Outputs

Cluster labels for all points, indices of exemplar points, and the final A + R matrix encoding cluster structure.

Model Assumptions

01The similarity matrix is meaningful: s(i,k) correctly quantifies how well k could represent i.

02Convergence is achievable with the chosen damping factor — oscillation can occur without sufficient damping.

03All n² similarities are stored in memory — O(n²) memory requirement.

04The preference parameter meaningfully encodes prior knowledge about which points are more likely exemplars.

Important Edge Cases

▸Preference too high: all points become exemplars — n clusters, each point its own cluster.
▸Preference too low: one or few points become exemplars — all points in one or two large clusters.
▸No convergence: oscillation between multiple valid exemplar sets. Increase damping λ toward 1.0.
▸Singular similarity (all similarities equal): no differentiation between exemplar candidates, algorithm may not converge meaningfully.
▸Asymmetric S: fully supported — AP does not require symmetric similarities, making it applicable to directed graphs.

Methodology / Workflow

Role in the ML Pipeline

Affinity Propagation sits in the unsupervised clustering stage, consuming a precomputed similarity matrix or raw features. Its output — cluster labels and exemplar indices — feeds downstream summarization, representative selection, or further analysis steps.

Data Preprocessing

01.Feature scaling is important when computing Euclidean similarities: StandardScaler or MinMaxScaler prevents high-range features from dominating distances.
02.Compute similarity matrix: s(i,k) = -||xᵢ - xₖ||² for Euclidean (default in sklearn). For text: use cosine similarity; for sequences: use edit distance negated.
03.Set the diagonal (preferences) carefully: sklearn defaults to the median of all pairwise similarities. Lower → fewer clusters. Higher → more clusters.
04.Remove or impute NaN — the similarity matrix must be fully defined.
05.For very large datasets: compute approximate similarities using random subsets or approximate NN and run AP on the representative subset.

Training Process

01.Start with default preference (median of similarities) and damping=0.5.
02.Check the number of exemplars produced — if too few (1–2) or too many (n), adjust preference.
03.If not converging, increase damping toward 0.9 or 0.95.
04.Sweep preference from min(S) to median(S) to max(S) and plot resulting n_clusters — find the plateau that matches domain knowledge.
05.Evaluate cluster quality via silhouette score for each preference setting.

Hyperparameters

Name

preference

Description

The diagonal of the similarity matrix s(k,k). Controls prior probability of each point being an exemplar. Scalar (same for all) or array (per-point).

Typical

median(S) for moderate clusters; min(S) for few large clusters; max(S) for many small clusters

Name

damping

Description

Mixing factor λ ∈ [0.5, 1) for message updates. Prevents oscillation between conflicting exemplar sets.

Typical

0.5 (minimum allowed). Increase to 0.7–0.9 if not converging.

Name

max_iter

Description

Maximum number of message-passing iterations.

Typical

200 (sklearn default). Increase to 500–2000 if convergence is slow.

Name

convergence_iter

Description

Number of iterations with no change in exemplar set to declare convergence.

Typical

15 (sklearn default)

Implementation Checklist

1pip install scikit-learn numpy
2Scale features: StandardScaler().fit_transform(X)
3Compute similarities: S = -euclidean_distances(X_scaled, squared=True) or custom kernel
4Set preference: S[np.diag_indices_from(S)] = np.median(S) (or custom value)
5Fit: ap = AffinityPropagation(preference=np.median(S), damping=0.5).fit(X_scaled)
6Extract: labels=ap.labels_, exemplars=ap.cluster_centers_indices_
7Evaluate: silhouette_score(X_scaled, labels)

Mathematical Chamber

Implementation

python

1import numpy as np
2
3class AffinityPropagation:
4    """
5    Affinity Propagation clustering via message passing.
6    """
7
8    def __init__(self, damping=0.5, max_iter=200, convergence_iter=15,
9                 preference=None):
10        self.damping = damping
11        self.max_iter = max_iter
12        self.convergence_iter = convergence_iter
13        self.preference = preference
14
15        # Outputs
16        self.cluster_centers_indices_ = None
17        self.labels_ = None
18        self.n_iter_ = 0
19
20    def fit(self, X):
21        n = X.shape[0]
22
23        # ── Build similarity matrix (negative squared Euclidean) ───────────────
24        # S[i,k] = -||xi - xk||^2
25        S = -np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=2)
26
27        # ── Set preference (diagonal of S) ─────────────────────────────────────
28        pref = self.preference if self.preference is not None else np.median(S)
29        np.fill_diagonal(S, pref)
30
31        # ── Initialize messages ────────────────────────────────────────────────
32        R = np.zeros((n, n))  # Responsibility matrix
33        A = np.zeros((n, n))  # Availability matrix
34
35        e = np.zeros((n, self.convergence_iter), dtype=bool)  # exemplar history
36        converged = False
37
38        for it in range(self.max_iter):
39            # ── Responsibility update ──────────────────────────────────────────
40            # For each (i,k): r(i,k) = S[i,k] - max_{k'≠k}[A[i,k'] + S[i,k']]
41            AS = A + S  # (n, n) — availability + similarity
42
43            # Find max over k' for each i, excluding diagonal (current k)
44            # Trick: find top-2 values per row; use 2nd-best when k == argmax
45            indices = np.argmax(AS, axis=1)           # argmax per row
46            max1 = AS[np.arange(n), indices]          # best value per row
47
48            # Second best: temporarily set best to -inf, take argmax again
49            AS_copy = AS.copy()
50            AS_copy[np.arange(n), indices] = -np.inf
51            max2 = AS_copy.max(axis=1)                # 2nd best per row
52
53            # R_new[i,k] = S[i,k] - max_{k'≠k}(A+S)[i,k']
54            # If k == argmax: subtract max2; otherwise subtract max1
55            max_matrix = np.where(np.arange(n)[:, np.newaxis] == indices[:, np.newaxis],
56                                  max2[:, np.newaxis], max1[:, np.newaxis])
57            # Note: need to broadcast correctly
58            # max_matrix[i,k] = max2[i] if k == indices[i] else max1[i]
59            col_indices = np.tile(np.arange(n), (n, 1))
60            is_best = (col_indices == indices[:, np.newaxis])
61            max_matrix = np.where(is_best, max2[:, np.newaxis], max1[:, np.newaxis])
62
63            R_new = S - max_matrix
64
65            # ── Damping ────────────────────────────────────────────────────────
66            R = self.damping * R + (1 - self.damping) * R_new
67
68            # ── Availability update ────────────────────────────────────────────
69            # Positive responsibilities from non-self points
70            Rp = np.maximum(R, 0)                   # (n, n) — clip at 0
71            np.fill_diagonal(Rp, np.diag(R))        # diagonal uses R (unclipped)
72
73            # a(k,k) = sum_{i'≠k} max(0, r(i',k))
74            A_diag = np.sum(Rp, axis=0) - np.diag(Rp)   # (n,) — column sums minus diag
75
76            # a(i,k) for i≠k = min(0, r(k,k) + sum_{i'≠i,i'≠k} max(0,r(i',k)))
77            # = min(0, A_diag[k] - max(0, R[i,k]) + R[k,k] - R[k,k])
78            # Simplified: a(i,k) = min(0, r(k,k) + col_sum_pos[k] - max(0,r(i,k)))
79            # where col_sum_pos[k] = sum of max(0, r(i',k)) over ALL i'
80            col_sum_Rp = np.sum(Rp, axis=0)  # sum of clipped R per column
81            diag_R = np.diag(R)              # r(k,k) for each k
82
83            # A_new[i,k] = min(0, r(k,k) + col_sum_Rp[k] - max(0, R[i,k]) - Rp[k,k])
84            # Note: Rp[k,k] = R[k,k] (diagonal is unclipped)
85            A_new = diag_R[np.newaxis, :] + col_sum_Rp[np.newaxis, :]                     - Rp - np.diag(Rp)[np.newaxis, :]
86            # Set diagonal: a(k,k) = A_diag[k]
87            np.fill_diagonal(A_new, A_diag)
88            # Cap off-diagonal at 0
89            A_new_off = np.minimum(A_new, 0)
90            np.fill_diagonal(A_new_off, A_diag)
91            A_new = A_new_off
92
93            # ── Damping ────────────────────────────────────────────────────────
94            A = self.damping * A + (1 - self.damping) * A_new
95
96            # ── Check for exemplars ────────────────────────────────────────────
97            D = A + R  # combined evidence matrix
98            exemplars = np.where(np.diag(D) > 0)[0]
99
100            # Store exemplar set in convergence buffer
101            e[:, it % self.convergence_iter] = (np.diag(D) > 0)
102            self.n_iter_ = it + 1
103
104            # ── Convergence check ──────────────────────────────────────────────
105            if it >= self.convergence_iter:
106                # Check if exemplar set has been stable for convergence_iter steps
107                se = e.sum(axis=1)
108                if se.min() == se.max() == 0 or (e[:, 0] == e[:, -1]).all():
109                    converged = True
110                    break
111
112        # ── Final assignment ───────────────────────────────────────────────────
113        D = A + R
114        exemplars = np.where(np.diag(D) > 0)[0]
115
116        if len(exemplars) == 0:
117            # No exemplars found — likely all noise or bad parameters
118            self.labels_ = np.full(n, -1, dtype=int)
119            self.cluster_centers_indices_ = np.array([], dtype=int)
120            return self
121
122        # Assign each point to nearest exemplar
123        assignment = np.argmax(S[:, exemplars], axis=1)
124        self.labels_ = assignment
125        self.cluster_centers_indices_ = exemplars
126
127        # Re-map labels to 0, 1, 2, ... cluster IDs
128        for idx, ex in enumerate(exemplars):
129            self.labels_[self.labels_ == idx] = idx
130
131        if not converged:
132            print(f"Warning: did not converge in {self.max_iter} iterations.")
133
134        return self
135
136
137# ── Demo ──────────────────────────────────────────────────────────────────────
138from sklearn.datasets import make_blobs
139from sklearn.preprocessing import StandardScaler
140from sklearn.metrics import silhouette_score
141import numpy as np
142
143np.random.seed(42)
144X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
145X_scaled = StandardScaler().fit_transform(X)
146
147ap = AffinityPropagation(damping=0.5, max_iter=200, preference=None)
148ap.fit(X_scaled)
149
150print(f"Converged in  : {ap.n_iter_} iterations")
151print(f"Exemplar idx  : {ap.cluster_centers_indices_}")
152print(f"N clusters    : {len(ap.cluster_centers_indices_)}")
153print(f"Exemplar pts  :")
154for idx in ap.cluster_centers_indices_:
155    print(f"  Point {idx}: {X_scaled[idx].round(3)}")
156
157if len(set(ap.labels_)) > 1:
158    sil = silhouette_score(X_scaled, ap.labels_)
159    print(f"Silhouette    : {sil:.4f}")
160

The implementation uses a broadcasting trick to efficiently compute max_{k'≠k}[A+S] in the responsibility update: find the best and second-best per row, then use the second-best when k equals the argmax index. The availability update leverages the identity that the sum excluding two elements can be computed from the full column sum minus those elements' contributions. Both updates are fully vectorized — no explicit loops over (i,k) pairs.

Sample Input

X: 400 points, 2D, 5 true clusters
damping=0.5, preference=median(S), max_iter=300

Sample Output

N clusters: 5
Converged in 87 iterations
Exemplar indices: [12, 87, 143, 234, 371]
Silhouette: 0.612
ARI: 0.943

Key Implementation Insights

→Passing affinity='precomputed' and fitting on the similarity matrix S directly is the most flexible approach — it allows any custom similarity function (cosine, RBF kernel, edit distance, etc.).
→The exemplar points (cluster_centers_indices_) are actual indices into the input data — you can access the original data points directly, unlike k-means centroids which may not correspond to any real point.
→Preference tuning is the primary lever for controlling n_clusters. A preference sweep with silhouette evaluation is the systematic way to choose it.
→Damping must be at least 0.5 by mathematical requirement. Increase toward 0.9 if convergence_iter is consistently hit without convergence.
→For n > 5K, AP becomes impractical due to the O(n²) similarity matrix — consider using AP on a sample or using a sparse similarity matrix.

Common Implementation Mistakes

✗Not sweeping preference — the default median preference rarely matches the true k. Always sweep.
✗Using affinity='euclidean' when you have a custom similarity — must use affinity='precomputed' with a precomputed similarity matrix.
✗Forgetting that exemplars are point indices, not feature values — access exemplar features as X[ap.cluster_centers_indices_].
✗Not increasing damping when not converging — the most common convergence issue is resolved by increasing λ.
✗Applying AP to n > 10K without subsampling — the O(n²) memory requirement causes OOM errors silently or explicitly.

Dataset Applicability

📊

Small Dataset (< 1K rows)

Excellent

AP's ideal regime. O(n²T) computation and O(n²) memory are manageable. Convergence is fast and the message-passing produces stable, high-quality exemplars.

💡 For n=500, the similarity matrix is 500×500=250K floats ≈ 2MB. Perfectly tractable.

📈

Medium Dataset (1K–5K rows)

Good

Still tractable. 5K×5K similarity matrix = 25M floats ≈ 200MB. Runtime is 1–10 minutes with default settings. Quality remains high.

💡 Consider using damping=0.7–0.9 for faster convergence. Subsample if runtime is prohibitive.

🗄️

Large Dataset (> 10K rows)

Poor

10K×10K similarity matrix = 800MB. 100K×100K = 80GB — completely infeasible. AP is not designed for large-scale clustering.

💡 Use Mini-Batch K-Means, BIRCH, or HDBSCAN for large datasets. For AP-like behavior at scale, use sparse AP or cluster a representative sample.

🔗

Custom Similarity (Non-Euclidean)

Excellent

AP's key strength over centroid-based methods. Any precomputed similarity (cosine, kernel, edit distance, biological sequence similarity) can be used. AP is agnostic to the metric.

💡 Use affinity='precomputed'. Ensure S[i,k] = S[k,i] for symmetric clustering; asymmetric S is also valid.

📐

High-Dimensional Data (d > 100)

Context-Dependent

AP itself handles any d since it works on the precomputed similarity matrix (n×n), not directly on d-dimensional features. Quality depends on whether the chosen similarity is meaningful in high dimensions.

💡 Use cosine similarity instead of Euclidean for high-d text/embedding data. Euclidean distances concentrate in high d.

⚖️

Imbalanced Cluster Sizes

Poor

AP's message-passing tends to produce roughly equal-sized clusters because availability messages accumulate based on the count of supporters. Large clusters dominate and may split, while tiny natural clusters may be absorbed into larger ones.

💡 Use per-point preferences to bias toward exemplars in small clusters. Set lower preference for points in dense regions.

Visualizations

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: affinity-propagation

Message Passing Convergence Over Iterations

Tracks the number of exemplars identified after each iteration. Initially volatile as messages are random, then stabilizes to the final exemplar count as responsibility and availability values converge. Damping smooths the trajectory.

Gradient descent convergence — MSE decreasing over iterations

Clusters vs. Preference Parameter

Number of clusters produced by AP as a function of the preference parameter (diagonal of S). Lower preference (more negative) → fewer, larger clusters. Higher preference (less negative / positive) → more, smaller clusters. The knee in the silhouette curve identifies the optimal preference.

Optimal preference = -10 (silhouette 0.62, k=5 matching true k=5). Below -15: under-clustering. Above -7: over-clustering.

Responsibility Matrix R Heatmap (After Convergence)

The n×n responsibility matrix at convergence. Strong positive values on the diagonal (r(k,k) > 0) identify exemplars. Off-diagonal entries r(i,k) show which non-exemplar points i support exemplar k. The block structure mirrors cluster assignments.

Positive r(k,k) on diagonal = exemplar. Negative r(i,k) for i≠k across clusters = rejection. The responsibility matrix encodes the full cluster structure.

Advantages & Limitations

Advantages

No need to specify k
AP determines the number of clusters from data structure via the preference parameter. Unlike k-means, you don't need to know k upfront — the algorithm discovers it. Preference can be swept systematically with silhouette evaluation.
Returns actual exemplar data points
Cluster representatives are real data points (exemplars), not synthetic centroids. This is critical in applications like image retrieval (find the representative image), text summarization (find the representative document), or sensor selection (find the best sensor location from candidates).
Deterministic results
Unlike k-means with random initialization, AP's message-passing is deterministic given fixed parameters. No need for multiple restarts to avoid bad local minima. Random seeds only affect tie-breaking in sklearn.
Handles arbitrary similarity functions
AP accepts any precomputed n×n similarity matrix — cosine similarity, RBF kernel, biological sequence similarity, edit distance, or domain-specific metrics. This makes it applicable wherever a meaningful similarity is defined.
Globally optimized objective
AP's message-passing provably maximizes the total similarity between points and their exemplars (the net similarity objective). It finds the best exemplar set jointly, not greedily.

Limitations

O(n²) memory and time
The similarity matrix requires O(n²) memory — 10K points need 800MB, 100K points need 80GB. Each iteration requires O(n²) computations. This makes AP impractical for n > 5K–10K without approximations.
Sensitive to preference parameter
The preference parameter dramatically affects the number of clusters. The default (median similarity) is a reasonable starting point but rarely optimal. Sweeping preference and evaluating silhouette adds significant experimental overhead.
Convergence not guaranteed
AP may oscillate between multiple valid exemplar configurations rather than converging, especially with low damping or degenerate similarity matrices. Increasing damping helps but slows convergence.
Biased toward equal cluster sizes
The availability update accumulates supporters linearly — large clusters give their exemplar strong availability boosts, overpowering small clusters. AP tends to equalize cluster sizes, which is problematic for naturally imbalanced data.
No noise/outlier detection
Unlike DBSCAN, AP assigns every point to some exemplar. There is no concept of noise. Outliers become singleton exemplars (if preference is high) or are absorbed into the nearest real cluster (if preference is low).

Practical Use Cases

Bioinformatics

Gene expression exemplar identification

Find representative gene expression profiles from microarray data. Exemplars are real genes (not synthetic centroids), enabling biological interpretation. AP with correlation-based similarity outperforms k-means on the actual Science 2007 paper's biological datasets.

Computer Vision

Face clustering and representative selection

Cluster face images in a photo collection and find the most representative face per cluster (the exemplar). The exemplar is an actual photo that can be displayed as the cluster thumbnail — not a synthetic average face.

NLP / Information Retrieval

Document exemplar selection for summarization

Given a large document collection, find exemplar documents that best represent each topic cluster. The exemplar document is an actual article — useful for search result deduplication, news clustering, and extractive summarization.

Network Science

Hub airport identification in airline networks

Similarity between airports based on shared route profiles. AP identifies hub airports as exemplars — the most central, well-connected airports in each geographic cluster. Used in airline route optimization and slot allocation.

Healthcare

Patient phenotype exemplar discovery

Find representative patient cases from EHR data using clinical feature similarity. Exemplar patients are real historical cases that clinicians can review — more interpretable than synthetic cluster centroids. Used in clinical decision support.

Comparison

Affinity Propagation is compared most naturally to k-means (both minimize distance to cluster representatives) and DBSCAN (both avoid specifying k). Its exemplar-based approach is unique among mainstream algorithms.

K-Means

Similarity

Both assign each point to one representative; both minimize total squared distance to representatives

Key Difference

K-Means requires k upfront, uses synthetic centroids (may not be real data points), is sensitive to initialization (random restarts), and is much faster O(nkT). AP determines k, uses real exemplars, is deterministic, but is O(n²T).

Choose When

K-Means for large datasets with known k and spherical clusters. AP for small datasets where k is unknown and exemplar data points are needed.

K-Medoids (PAM)

Similarity

Both select actual data points as cluster representatives; both minimize total distance to representatives

Key Difference

K-Medoids requires k upfront and searches for medoids via combinatorial optimization. AP automatically determines k via message passing. K-Medoids is O(k(n-k)²) per iteration — also expensive but different scaling.

Choose When

K-Medoids when you know k and want medoids (guaranteed global optimum via PAM). AP when k is unknown.

DBSCAN

Similarity

Both avoid specifying k; both can produce variable numbers of clusters based on parameters

Key Difference

DBSCAN finds density-connected clusters of arbitrary shape and labels noise. AP finds exemplar-based clusters (roughly spherical if using Euclidean similarity), assigns all points, and is deterministic. DBSCAN is O(n log n); AP is O(n²).

Choose When

DBSCAN for arbitrary-shape clusters with noise on large datasets. AP for small datasets where actual exemplar data points are needed and cluster shape is roughly spherical.

Hierarchical Agglomerative Clustering

Similarity

Both avoid specifying k and produce cluster structure from pairwise distances

Key Difference

Hierarchical clustering builds a full dendrogram — you choose k by cutting at any level. AP produces a flat partition. AP is O(n²T); hierarchical is O(n² log n). AP uses message-passing; hierarchical uses bottom-up merging.

Choose When

Hierarchical when you want to explore multiple k values from a single run (dendrogram). AP when you want automatic k selection and exemplar data points.

Property	AP	K-Means	K-Medoids	DBSCAN
Requires k upfront	✗ No	✓ Yes	✓ Yes	✗ No
Real-point exemplars	✓ Yes	✗ No (centroids)	✓ Yes	N/A
Deterministic	✓ Yes	✗ No (random init)	✗ Partial	✓ Yes
Arbitrary similarity	✓ Yes	✗ No (Euclidean)	✓ Yes	✓ Yes
Noise detection	✗ No	✗ No	✗ No	✓ Yes
Time complexity	O(n²T)	O(nkT)	O(k(n-k)²)	O(n log n)
Memory	O(n²)	O(n+k)	O(n²)	O(n)

Choose Affinity Propagation when:

Dataset is small (< 5K), k is unknown, you need actual data points as cluster representatives (not synthetic centroids), and you have a meaningful custom similarity function or need deterministic results.

Evaluation

Silhouette Score

Mean cohesion vs. separation per point. Use silhouette to evaluate different preference settings and choose the best. AP with the correct k will typically achieve silhouette > 0.5 on clean datasets.

Target: > 0.5 for well-separated clusters; use to compare preference settings

Net Similarity Objective

The sum of similarities between each point and its assigned exemplar. This is the objective AP directly maximizes. Higher is better (less negative for Euclidean-based similarity).

Target: Increases monotonically as preference increases — not useful for comparing across preference values without normalizing

Number of Exemplars

The number of clusters found. Use preference sweep to control this. Validate against domain knowledge — too few indicates over-merging, too many indicates over-fragmentation.

Target: Should match domain expectation; silhouette maximization identifies the sweet spot

Convergence Iterations

How many iterations before the exemplar set stabilized. Should be well below max_iter. If equal to max_iter, the algorithm didn't converge — increase damping.

Target: < 80% of max_iter for clean convergence

Evaluation Process

01.1. Compute similarity matrix S and note range: S_min, S_median, S_max.
02.2. Sweep preference over [S_min, S_median] in 20 logarithmically spaced steps.
03.3. For each preference, compute silhouette score on the resulting labels.
04.4. Select the preference that maximizes silhouette while producing a sensible number of clusters.
05.5. Check convergence: ensure n_iter_ << max_iter for the chosen preference.
06.6. Validate exemplars with domain knowledge — are they meaningful representatives?
07.7. If ground truth is available, compute ARI and NMI.

Evaluation Traps

▸Trusting the default preference (median) — it rarely gives the optimal k. Always sweep.
▸Ignoring convergence — if n_iter_ == max_iter, the results are unreliable. Increase max_iter or damping.
▸Computing silhouette with n=1 cluster (all labels the same) — this will raise a ValueError. Check len(unique_labels) > 1 first.
▸Comparing AP's net similarity objective across different preference values — it always increases with preference by construction, making it useless for selecting preference.

Real-World Interpretation Example

Gene expression dataset, n=500 genes, 50 conditions: S = correlation matrix (transformed to [-1,1] similarities). Default preference=median=0.12 → 47 clusters (too many). Sweep to preference=-0.2 → 8 clusters, silhouette=0.58. Domain experts validate: the 8 exemplar genes are known regulatory hub genes in the relevant pathways. ARI vs. known gene families = 0.74. Convergence in 43/200 iterations with damping=0.7.

Common Mistakes

Students

×Thinking preference is the number of clusters — it's a similarity value that biases which points become exemplars.
×Confusing exemplar indices (integers indexing into X) with cluster labels — ap.cluster_centers_indices_ are row indices, not cluster IDs.
×Expecting AP to scale to large datasets — at n=50K, the 20GB similarity matrix will crash most machines.
×Not checking convergence — silently using results from a non-converged run.

Developers

×Using affinity='euclidean' when they mean affinity='precomputed' — sklearn's affinity='euclidean' computes negative squared distances internally, but any custom similarity requires 'precomputed'.
×Passing the feature matrix X to fit() when affinity='precomputed' — must pass the similarity matrix S.
×Not increasing max_iter when warnings appear — the default 200 is often insufficient for complex similarity structures.
×Forgetting that A+R matrix must be checked, not just labels_, to understand convergence quality.

In Interviews

×Saying AP is 'like k-means but for arbitrary k' — AP's exemplar-finding mechanism is fundamentally different from k-means centroid optimization.
×Not knowing that responsibility is sent from point to exemplar candidate, and availability from exemplar candidate back to point.
×Claiming AP always finds the global optimum — it finds a local optimum of the net similarity objective via message passing, which may not be globally optimal.
×Not knowing the O(n²) memory limitation — this is the primary practical constraint.

Real Projects

×Running AP on n > 5K without checking available RAM — OOM crashes or swap thrashing.
×Using Euclidean similarity for text embeddings — cosine similarity is far more appropriate and often gives dramatically better clusters.
×Not validating exemplars with domain knowledge — the algorithm may select exemplars that are technically optimal but not meaningful to end users.
×Reporting n_clusters from AP without noting that it's preference-dependent — always report preference used.

Core ML Thinking Lens

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

AP passes responsibility (from point to candidate exemplar) and availability (from candidate back to point) messages iteratively
Responsibility r(i,k) = s(i,k) - max_{k'≠k}[a(i,k') + s(i,k')] — similarity to k minus best competition
Availability a(i,k) = min(0, r(k,k) + Σ_{i'} max(0, r(i',k))) — capped at 0 for non-self
Exemplar of i = argmax_k [a(i,k) + r(i,k)] after convergence
Preference = s(k,k) = diagonal of S. Higher → more clusters. Default = median(S)
Damping λ ∈ [0.5, 1) prevents oscillation — increase to 0.9 if not converging
O(n²) memory and time — limited to n < 5K in practice

Critical Formulas

Responsibility

Availability (i≠k)

Self-Availability

Damped Update

Exemplar Decision

Best For

✓Small datasets where k is unknown
✓Applications requiring actual exemplar data points (not synthetic centroids)
✓Custom/non-Euclidean similarity matrices
✓Deterministic clustering without random restarts

Avoid When

✗n > 5K (O(n²) memory/time is prohibitive)
✗You already know k (k-means is faster)
✗Highly imbalanced natural cluster sizes
✗Convergence cannot be achieved (check n_iter_ == max_iter)

Interview Must-Know

★Explain responsibility and availability messages and their directionality

★State what the preference parameter controls and how to tune it

★Explain why damping is needed and what happens without it

★State the O(n²) time and memory complexity

★Contrast exemplars (real data points) with k-means centroids (synthetic)

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.