ML Atlas

Gaussian Mixture Models

Probabilistic clustering via the EM algorithm — K-Means with uncertainty.

AdvancedUnsupervisedMath Heavy
26 min read
k-meansnaive-bayespca
  • Speaker diarization — which speaker said what
  • Image segmentation by color/texture mixture
  • Anomaly detection via low-likelihood samples
  • Financial return distribution modeling
  • Density estimation for generative modeling
01

In Plain English

A Gaussian Mixture Model assumes data was generated by a mixture of several Gaussian distributions. Each Gaussian is a cluster. GMM learns the mean, covariance, and weight of each Gaussian from data — and tells you the probability that each point belongs to each cluster.

Why It Exists

K-Means makes hard assignments (each point belongs to exactly one cluster) and assumes spherical clusters. GMM makes soft probabilistic assignments and allows clusters of any shape (via covariance matrices). It's K-Means generalized to a proper probabilistic framework.

Problem It Solves

Clusters that overlap, have different shapes/orientations, or need uncertainty estimates. GMM provides: (1) soft cluster membership, (2) full covariance modeling, (3) a principled density model.

Real-Life Analogy

"Imagine heights in a population that includes both men and women. The height distribution has two humps — a mixture of two Gaussians. GMM fits those two Gaussians simultaneously, and for any height tells you: 'probably male with 70% confidence, female with 30%.'"

When To Use

  • Clusters have different shapes, sizes, or orientations
  • You need soft cluster membership (probability per cluster)
  • The data is genuinely generated by a mixture of distributions
  • Density estimation for anomaly detection
  • As a generative model for sampling new data

When NOT To Use

  • Very high-dimensional data without regularization (singular covariance)
  • When you need simple, fast clustering (use K-Means)
  • Heavily non-Gaussian cluster shapes (use DBSCAN or spectral)
  • Very large datasets — EM is iterative and slow
  • When k is truly unknown (GMM still requires specifying k, use BIC to choose)
02

GMM is a latent variable model: imagine each data point was generated by first picking a cluster (with probability πₖ), then drawing from the Gaussian for that cluster (with mean μₖ and covariance Σₖ). We observe x but not which cluster generated it.

The EM algorithm alternates: E-step computes the probability that each cluster generated each point (soft assignment). M-step updates the Gaussian parameters to maximize the likelihood given those soft assignments. Repeat until convergence.

The result is a proper probability density over all of data space. Any point can be scored — low-likelihood points are anomalies. This is the key advantage over K-Means: GMM is a full generative model.

The Metaphor

"Imagine sorting a shuffled deck of cards from multiple decks mixed together. You don't know how many decks or their compositions. GMM is like figuring out: 'this card probably came from a red deck (70%) or a blue deck (30%)' — and updating your belief about each deck as you process more cards."

Beginner Mental Model

Start with k Gaussians placed randomly. For each point, compute how likely each Gaussian generated it (E-step). Then pull each Gaussian toward the points it likely generated, weighted by those probabilities (M-step). Repeat until stable.

03

A GMM defines the density p(x) = Σₖ πₖ 𝒩(x; μₖ, Σₖ) where πₖ are mixture weights (Σπₖ=1, πₖ≥0), μₖ are cluster means, and Σₖ are covariance matrices. Parameters are fit by maximizing the log-likelihood via EM.

Mixture weight πₖ
Prior probability that a data point belongs to component k
Responsibility rᵢₖ
Posterior probability that component k generated point xᵢ (computed in E-step)
E-step
Expectation step: compute responsibilities rᵢₖ using current parameters
M-step
Maximization step: update μₖ, Σₖ, πₖ to maximize weighted log-likelihood
ELBO
Evidence Lower BOund — the quantity EM actually maximizes; equal to log-likelihood when responsibilities are exact
Covariance type
Constraint on Σₖ: full (any), tied (shared), diagonal, spherical — controls flexibility vs. overfitting
Degeneracy/singularity
When a Gaussian collapses onto a single point — covariance → 0, likelihood → ∞ — a pathological failure
  1. Initialize: set k, initialize μₖ (e.g., from K-Means), Σₖ = I, πₖ = 1/k
  2. E-step: compute rᵢₖ = πₖ 𝒩(xᵢ; μₖ, Σₖ) / Σⱼ πⱼ 𝒩(xᵢ; μⱼ, Σⱼ)
  3. M-step: update Nₖ = Σᵢ rᵢₖ, then μₖ = (1/Nₖ) Σᵢ rᵢₖ xᵢ
  4. M-step: update Σₖ = (1/Nₖ) Σᵢ rᵢₖ (xᵢ-μₖ)(xᵢ-μₖ)ᵀ
  5. M-step: update πₖ = Nₖ/n
  6. Compute log-likelihood: L = Σᵢ log(Σₖ πₖ 𝒩(xᵢ; μₖ, Σₖ))
  7. Repeat E/M until ΔL < ε (convergence)

Feature matrix X ∈ ℝⁿˣᵈ, number of components k, covariance_type

Cluster labels (argmax of responsibilities), soft probabilities per cluster, fitted density p(x)

01Data is generated by a finite mixture of Gaussian distributions
02k is specified by the user (use BIC/AIC to select)
03Observations are i.i.d.
04No degeneracy: clusters don't collapse to single points
  • Singular covariance: component collapses onto a point — likelihood explodes (use regularization)
  • k too large: components merge or become degenerate
  • EM convergence to local optimum: restart with different initializations
  • All points assigned to one component: k too small or bad initialization
04

Unsupervised clustering or density estimation step. Apply after feature scaling. Use BIC/AIC to select k. Output: cluster labels OR soft probabilities for downstream use.

  • 01.StandardScale all features — GMM distance-based E-step is sensitive to scale
  • 02.Apply PCA if d > 20 to avoid singular covariance matrices
  • 03.Remove extreme outliers — they can hijack a Gaussian component
  • 04.For categorical features: encode and scale; GMM is natively for continuous data
  • 01.Scale data with StandardScaler
  • 02.Fit GMM with n_init=10 (multiple restarts to avoid local optima)
  • 03.Use init_params='kmeans' for stable initialization
  • 04.Sweep k from 1 to K_max, compute BIC for each
  • 05.Select k at the BIC elbow
  • 06.Inspect predict_proba() for soft assignments

n_components (k)

Number of Gaussian components

Use BIC/AIC sweep from 1 to 15

covariance_type

Constraint on covariance matrices

'full' for flexibility; 'diag' for high-d; 'tied' for shared shape

n_init

Number of EM restarts from different initializations

10 (default 1 in sklearn — increase it)

reg_covar

Regularization added to diagonal of covariance (prevents singularity)

1e-6 (default); increase to 1e-3 if singular covariance errors occur

  1. 1Scale: StandardScaler().fit_transform(X)
  2. 2BIC sweep: [GaussianMixture(k).fit(X).bic(X) for k in range(1, 15)]
  3. 3Pick k at BIC elbow
  4. 4Final fit: GaussianMixture(k, n_init=10, covariance_type='full')
  5. 5Soft labels: gmm.predict_proba(X); hard labels: gmm.predict(X)
  6. 6Anomaly scores: -gmm.score_samples(X) — high score = anomaly
05
06
python
1import numpy as np
2
3class GaussianMixture:
4    def __init__(self, k=3, n_iter=100, tol=1e-4, reg=1e-6):
5        self.k = k
6        self.n_iter = n_iter
7        self.tol = tol
8        self.reg = reg
9
10    def fit(self, X):
11        n, d = X.shape
12        # Initialize from K-Means
13        from sklearn.cluster import KMeans
14        km = KMeans(self.k, n_init=5, random_state=0).fit(X)
15        self.means = km.cluster_centers_.copy()
16        self.covs = [np.eye(d) for _ in range(self.k)]
17        self.weights = np.ones(self.k) / self.k
18        self.log_likelihoods = []
19
20        for _ in range(self.n_iter):
21            # E-step
22            R = self._e_step(X)
23
24            # M-step
25            Nk = R.sum(axis=0) + 1e-8
26            self.weights = Nk / n
27            self.means = (R.T @ X) / Nk[:, None]
28            for k in range(self.k):
29                diff = X - self.means[k]
30                self.covs[k] = (R[:, k:k+1] * diff).T @ diff / Nk[k]
31                self.covs[k] += self.reg * np.eye(d)
32
33            ll = self._log_likelihood(X)
34            self.log_likelihoods.append(ll)
35            if len(self.log_likelihoods) > 1 and abs(ll - self.log_likelihoods[-2]) < self.tol:
36                break
37        return self
38
39    def _gauss_pdf(self, X, mean, cov):
40        d = X.shape[1]
41        diff = X - mean
42        sign, logdet = np.linalg.slogdet(cov)
43        inv_cov = np.linalg.inv(cov)
44        exponent = -0.5 * np.sum(diff @ inv_cov * diff, axis=1)
45        return np.exp(exponent - 0.5 * (d * np.log(2 * np.pi) + logdet))
46
47    def _e_step(self, X):
48        R = np.column_stack([
49            self.weights[k] * self._gauss_pdf(X, self.means[k], self.covs[k])
50            for k in range(self.k)
51        ])
52        R /= R.sum(axis=1, keepdims=True) + 1e-300
53        return R
54
55    def _log_likelihood(self, X):
56        lls = np.column_stack([
57            self.weights[k] * self._gauss_pdf(X, self.means[k], self.covs[k])
58            for k in range(self.k)
59        ])
60        return np.sum(np.log(lls.sum(axis=1) + 1e-300))
61
62    def predict_proba(self, X):
63        return self._e_step(X)
64
65    def predict(self, X):
66        return self.predict_proba(X).argmax(axis=1)
67
68    def score_samples(self, X):
69        lls = np.column_stack([
70            self.weights[k] * self._gauss_pdf(X, self.means[k], self.covs[k])
71            for k in range(self.k)
72        ])
73        return np.log(lls.sum(axis=1) + 1e-300)
74
75# Usage
76from sklearn.datasets import make_blobs
77X, y = make_blobs(n_samples=300, centers=3, random_state=42)
78gmm = GaussianMixture(k=3).fit(X)
79print(gmm.predict_proba(X[:5]).round(3))
Full EM with K-Means initialization, covariance regularization, and log-likelihood convergence tracking.
X shape: (300, 2) — three overlapping blob clusters
predict_proba: [[0.97, 0.02, 0.01], [0.05, 0.93, 0.02], ...]; anomaly threshold at 5th percentile of log p(x)
  • Always use n_init > 1 — EM converges to local optima, multiple restarts help
  • Use K-Means or 'kmeans' init — random init often leads to degenerate solutions
  • BIC penalizes complexity more than AIC — prefer BIC for model selection
  • reg_covar prevents singular covariance — always set it even if sklearn defaults to 1e-6
  • score_samples() gives log p(x) — negative = anomaly scoring with no extra model needed
  • Using n_init=1 (sklearn default) — often converges to a local optimum
  • Not scaling features — one large-scale feature dominates covariance
  • Using 'full' covariance with d >> n — singular matrices, use 'diag' instead
  • Choosing k by eyeballing the plot instead of BIC/AIC
  • Forgetting that hard labels from predict() discard all uncertainty information
07

Low-d continuous features (d < 20)

Excellent

GMM's home territory — full covariance can be estimated reliably

💡 Use covariance_type='full'; sweep k with BIC
📐

High-dimensional data (d > 50)

Context-Dependent

Full covariance needs O(d²) parameters per component — often singular

💡 Apply PCA first; use covariance_type='diag' or 'tied'

Overlapping clusters

Excellent

GMM's key advantage — soft assignments handle overlap naturally

💡 K-Means fails here; GMM gives probability split between overlapping clusters

Non-Gaussian cluster shapes

Poor

Ring, moon, or fractal shapes — no Gaussian can fit these

💡 Use DBSCAN or spectral clustering for non-convex shapes

Anomaly detection

Good

score_samples gives log p(x) — low score = anomalous

💡 Works well when normal data is Gaussian-like; fails for non-Gaussian normal
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: gaussian-mixture-models

Log-Likelihood Convergence During EM

EM guarantees monotone increase in log-likelihood — convergence typically within 50-100 iterations

Gradient descent convergence — MSE decreasing over iterations

BIC Score vs. Number of Components

BIC penalizes complexity — the elbow indicates the optimal k

Comparison visualization data is documented in this section.
09
  • Soft probabilistic assignments

    Every point gets a probability vector over clusters — not just a hard label.

  • Flexible cluster shapes

    Full covariance allows elliptical clusters of any orientation — not just spherical like K-Means.

  • Full density model

    p(x) is defined everywhere — enables anomaly scoring, sampling, and density estimation.

  • Principled model selection

    BIC and AIC give theoretically grounded ways to choose k.

  • Generalizes K-Means

    K-Means is GMM with spherical (σ²I) covariance and hard assignments — GMM is strictly more expressive.

  • Local optima

    EM is not guaranteed to find the global maximum — multiple restarts (n_init > 1) are essential.

  • Singular covariance risk

    A component can collapse onto a point (Σ → 0, likelihood → ∞) — requires regularization.

  • Must specify k

    Unlike DBSCAN or Mean Shift, GMM requires you to choose the number of components.

  • Assumes Gaussian components

    Severely non-Gaussian data (ring shapes, heavy tails) is poorly modeled.

  • Slow for large datasets

    Full covariance EM is O(nd²k) per iteration — slow for high-d or large n.

10
Audio/Speech

Speaker diarization

Model each speaker's voice features as a Gaussian; EM assigns speech segments to speakers.

Finance

Return distribution modeling

Asset returns have fat tails and regimes — a GMM with 2-3 components captures bull/bear/crisis regimes.

Computer Vision

Background subtraction

Model background pixel distributions as a GMM; foreground = low-probability pixels.

Genomics

Cell type discovery

Gene expression profiles modeled as Gaussian mixtures to discover cell type clusters.

11

GMM vs. other clustering and density estimation methods:

K-Means

Both partition data into k clusters

K-Means: hard assignments, spherical clusters, minimizes inertia. GMM: soft assignments, any covariance shape, maximizes likelihood. K-Means is a special case of GMM.

K-Means for speed and simplicity; GMM when cluster shapes vary or uncertainty matters.

DBSCAN

Both are unsupervised clustering

DBSCAN: no k required, handles arbitrary shapes and noise, no probabilistic model. GMM: requires k, assumes Gaussian, gives probabilities.

DBSCAN for non-Gaussian shapes or noise-heavy data. GMM for well-separated, roughly Gaussian clusters with overlap.

Naive Bayes (generative)

Both use Gaussian density per class

Naive Bayes is supervised (uses labels); GMM is unsupervised (discovers labels). Naive Bayes assumes feature independence (diagonal covariance); GMM can use full covariance.

Naive Bayes with labels; GMM without labels.

AspectGMMK-MeansDBSCAN
AssignmentSoft (probability)HardHard + noise
Cluster shapeEllipticalSphericalAny
Requires kYes (use BIC)YesNo
Density modelYesNoNo
Handles noisePartialNoYes
SpeedMediumFastMedium

You need probabilistic cluster membership, elliptical cluster shapes, a full density model, or anomaly scoring — and the data is roughly Gaussian within each cluster.

12

BIC (Bayesian Information Criterion)

Penalized log-likelihood — lower is better. Use to select k.

Target: Minimize over k; pick k at the elbow

Log-likelihood

How well the model explains the data — higher is better

Target: Should increase monotonically during EM and plateau at convergence

Silhouette Score (on hard labels)

Cluster separation quality for the hard-assigned labels

Target: > 0.5

  1. 01.Sweep k from 1 to 15 and compute BIC and AIC for each
  2. 02.Select k at BIC elbow (not necessarily minimum — use domain knowledge)
  3. 03.Check log-likelihood convergence curve — should be smooth and plateau
  4. 04.Inspect predict_proba() — many 50/50 splits may indicate k too large
  5. 05.Visualize with scatter plot colored by hard labels (first 2 PCs)
  • Using n_init=1 — EM local optima cause poor reproducibility
  • Choosing k at the BIC minimum without domain validation
  • Using 'full' covariance with d > n/10 — singular matrices inevitable
  • Ignoring reg_covar — silent numerical instability
  • Evaluating only on hard labels — you lose GMM's key advantage (soft probabilities)

On a customer dataset (n=2000, d=8): BIC selects k=4. Log-likelihood converges in 45 iterations. Silhouette=0.52. predict_proba() shows three well-defined clusters (>90% one component) and one transition cluster with 60/40 split between two segments — GMM correctly identifies the ambiguous customers, unlike K-Means which would force-assign them.

13
  • ×Treating GMM output hard labels as equally good as K-Means — missing the point of soft assignments
  • ×Not using BIC/AIC for k selection — picking k visually
  • ×Forgetting to scale features before fitting — dominated by high-variance features
  • ×Using n_init=1 (sklearn default) — always set n_init ≥ 5 for reliability
  • ×Using full covariance on high-d data without PCA — guaranteed singular matrix errors
  • ×Not adding reg_covar regularization on real-world noisy data
  • ×Saying GMM is the same as K-Means — it's a strict generalization
  • ×Not knowing what EM stands for or what E/M steps do
  • ×Claiming GMM can handle any cluster shape — it assumes Gaussian components
  • ×Using GMM for anomaly detection without understanding the Gaussian assumption
  • ×Running EM once and trusting results — always use multiple restarts
  • ×Interpreting all components as meaningful — some may be artifacts of noise or poor k selection
14

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • p(x) = Σₖ πₖ 𝒩(x; μₖ, Σₖ) — weighted sum of k Gaussians
  • EM alternates E-step (responsibilities = soft assignments) and M-step (parameter update)
  • Generalizes K-Means: soft assignments + any covariance shape
  • Use BIC to select k; n_init ≥ 5 to avoid local optima
  • score_samples() gives log p(x) — use for anomaly detection
GMM density
E-step
BIC
  • Overlapping cluster separation
  • Soft/probabilistic cluster assignments
  • Density estimation and anomaly detection
  • Data known to be Gaussian-distributed per cluster
  • Non-Gaussian cluster shapes
  • Very high-dimensional data without PCA
  • Need automatic k selection (use DBSCAN/Mean Shift)
  • Large n + large d (too slow)
EM = E-step (compute responsibilities via Bayes) + M-step (weighted MLE update)
GMM is K-Means with soft assignments and full covariance
BIC selects k by penalizing model complexity
Singular covariance is the main failure mode — reg_covar fixes it
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.