Linear Discriminant Analysis

Concept Overview

In Plain English

LDA finds the directions in feature space that best separate two or more classes. Unlike PCA which asks 'where does the data spread most?', LDA asks 'where do the classes differ most?' It simultaneously maximizes the distance between class means and minimizes the spread within each class, then projects your data onto these discriminant directions.

Why It Exists

PCA ignores class labels during compression. A direction of high variance might mix all classes together, while a low-variance direction might perfectly separate them. LDA was developed to use label information to find projections that are genuinely useful for classification, not just for reconstructing the raw data.

Problem It Solves

Given a labeled dataset with C classes and d features, find k <= C-1 directions that maximize the ratio of between-class scatter to within-class scatter. Project data onto these directions to get a low-dimensional representation optimally designed for class separation.

Real-Life Analogy

"Imagine two groups of people on a field — short vs. tall. Viewed from directly overhead (projection onto the ground), everyone overlaps. But viewed from the side, the groups are perfectly separated by height. LDA finds the 'side view' — the viewing angle where the groups are most spread apart relative to how spread out each group is on its own. PCA might choose the overhead view because it captures the most total movement (people walking around), but LDA chooses the side view because it reveals what actually distinguishes the groups."

When To Use

You have class labels and want a compressed representation optimized for classification
You want to reduce to at most C-1 dimensions before a classifier (where C = number of classes)
Features are approximately Gaussian within each class and covariances are similar across classes
You want a linear classifier with built-in dimensionality reduction
You need a fast, interpretable classification model with few parameters
You want to visualize class separability in 1D or 2D (binary → 1D, 3-class → 2D)

When NOT To Use

Data is highly non-Gaussian within classes (use kernel methods or tree-based models)
Class covariances are very different from each other (use QDA instead)
n << d (more features than samples) — LDA's within-class scatter matrix becomes singular (use shrinkage LDA)
You have no class labels (use PCA or other unsupervised methods)
Decision boundaries are non-linear (use logistic regression with polynomial features, SVMs, or neural nets)

Core Intuition

LDA is built on one key idea: a good projection for classification should spread class means apart while keeping the samples of each class tightly clustered around their mean. If you project onto a direction where class means are far from each other but each class is compact, you get a clean separation. If you project onto a direction where classes overlap heavily, the projection is useless for classification.

Formally, LDA defines two scatter matrices. The within-class scatter matrix S_W measures how much the data within each class varies around its own mean — you want this small, meaning each class is compact. The between-class scatter matrix S_B measures how far the class means are from the global mean — you want this large, meaning classes are well-separated. LDA finds the projection direction w that maximizes the ratio S_B / S_W, known as Fisher's criterion.

LDA also works as a generative classifier. It models each class as a Gaussian distribution with a class-specific mean but a shared covariance matrix (pooled from all classes). Using Bayes' rule, the log-posterior ratio between two classes is a linear function of the input — hence 'linear' discriminant analysis. This dual interpretation (dimensionality reduction + generative classifier) makes LDA both powerful and analytically tractable.

The Metaphor

"Think of LDA as finding the best angle to photograph two overlapping crowds. PCA would find the angle where both crowds together look most spread out. LDA finds the angle where the two crowds' centers are farthest apart relative to how wide each crowd is individually. It tilts the camera until the gap between the crowds is maximized compared to the blur of each crowd — the clearest distinguishing shot."

Beginner Mental Model

LDA = 'find the direction where class averages are far apart AND each class is tightly packed.' Project your data onto that direction. Now even a simple threshold rule separates the classes. It's like finding the dimension along which the classes differ most, then squishing your data down to just that dimension (or the C-1 most discriminating dimensions for C classes).

Technical Theory

Formal Definition

Given labeled data {(xᵢ, yᵢ)} with C classes and d features, LDA finds projection vectors W = [w₁, ..., w_{C-1}] ∈ ℝᵈˣ⁽ᶜ⁻¹⁾ that maximize Fisher's criterion J(W) = det(WᵀS_B W) / det(WᵀS_W W), where S_W is the within-class scatter matrix and S_B is the between-class scatter matrix. The solution is given by the generalized eigenvalue problem S_W⁻¹ S_B w = λw — the top eigenvectors form the discriminant directions. Projected data: Z = XW ∈ ℝⁿˣ⁽ᶜ⁻¹⁾.

Key Terms

Within-Class Scatter Matrix (S_W): A d×d matrix summing the scatter of each class around its own mean: S_W = Σ_c Σ_{i∈c} (xᵢ - μ_c)(xᵢ - μ_c)ᵀ. Small S_W means each class is compact. LDA wants S_W small in the projection direction.
Between-Class Scatter Matrix (S_B): A d×d matrix measuring how far class means deviate from the global mean: S_B = Σ_c nᵢ(μ_c - μ)(μ_c - μ)ᵀ. Large S_B means class means are spread apart. LDA wants S_B large in the projection direction.
Fisher's Criterion (J): The objective function LDA maximizes: J(w) = (wᵀ S_B w) / (wᵀ S_W w). This is the ratio of projected between-class variance to projected within-class variance. Maximizing J finds the direction where classes are most separated relative to their internal spread.
Generalized Eigenvalue Problem: The equation S_W⁻¹ S_B w = λw whose solutions are the discriminant directions. Equivalent to finding the eigenvectors of the matrix S_W⁻¹ S_B. The eigenvalue λ equals the value of Fisher's criterion J(w) for the corresponding eigenvector w.
Linear Discriminant (LD): A projection direction (eigenvector of S_W⁻¹ S_B). LD1 maximizes Fisher's criterion; LD2 is the second-best direction orthogonal in a generalized sense; and so on. For C classes, at most C-1 non-trivial LDs exist.
Pooled Covariance Matrix: The shared covariance estimate used in LDA's generative model: Σ_pooled = S_W / (n - C). LDA assumes all classes share this single covariance. If this assumption fails, QDA (which fits per-class covariances) is more appropriate.
Shrinkage / Regularization: A technique to handle ill-conditioned S_W by replacing it with (1-α)S_W + α·tr(S_W)/d·I, blending the empirical scatter with a scaled identity matrix. Controls the condition number of S_W and stabilizes LDA when n is small relative to d.
QDA (Quadratic Discriminant Analysis): The generalization of LDA that allows each class to have its own covariance matrix. Produces quadratic (not linear) decision boundaries. More flexible than LDA but requires estimating C separate covariance matrices — needs more data per class.

Step-by-Step Working

1. Compute class statistics: for each class c, compute the class mean μ_c = (1/n_c) Σ_{i∈c} xᵢ. Compute the global mean μ = (1/n) Σᵢ xᵢ.
2. Compute within-class scatter: S_W = Σ_c Σ_{i∈c} (xᵢ - μ_c)(xᵢ - μ_c)ᵀ. Sum scatter matrices of each class.
3. Compute between-class scatter: S_B = Σ_c n_c (μ_c - μ)(μ_c - μ)ᵀ. Weighted sum of outer products of class-mean deviations.
4. Form the generalized eigenvalue problem: find eigenvectors of A = S_W⁻¹ S_B (or solve the generalized form S_B w = λ S_W w).
5. Sort eigenvectors by descending eigenvalue (descending Fisher criterion value).
6. Choose k <= C-1 discriminant directions. For binary classification, k=1 is typical.
7. Form projection matrix W = [w₁, ..., wₖ] ∈ ℝᵈˣᵏ.
8. Project data: Z = XW ∈ ℝⁿˣᵏ. For classification, assign class by nearest projected class mean or fit a classifier in Z-space.

Inputs

Labeled feature matrix X ∈ ℝⁿˣᵈ with class labels y ∈ {1, ..., C}. All features must be numeric. Features should be on comparable scales (standardize if units differ). Requires at least C+d samples for non-singular S_W.

Outputs

Projection matrix W ∈ ℝᵈˣᵏ (k <= C-1 discriminant directions), projected data Z ∈ ℝⁿˣᵏ, class-conditional means in projected space, and (as a classifier) predicted class probabilities via Bayes' rule.

Model Assumptions

01Gaussian class-conditionals: each class follows a multivariate Gaussian distribution p(x|y=c) = N(μ_c, Σ). If this fails badly, LDA's generative model makes poor predictions.

02Shared covariance (homoscedasticity): all classes have the same covariance matrix Σ. LDA pools class scatter to estimate this. If covariances differ substantially, use QDA.

03Linearity: decision boundaries between classes are linear hyperplanes. Non-linear boundaries require kernel LDA or non-linear classifiers.

04No perfect multicollinearity: S_W must be invertible. If d > n - C, S_W is singular. Fix with shrinkage LDA or by reducing d first (e.g., with PCA).

05Sufficient data per class: each class needs at least d+1 samples to estimate its scatter contribution to S_W reliably. With few samples, use shrinkage.

Important Edge Cases

▸d >= n - C (singular S_W): within-class scatter matrix is rank-deficient. S_W⁻¹ does not exist. Fix: apply PCA first to reduce d < n-C, then LDA. Or use shrinkage LDA (sklearn's solver='lsqr' with shrinkage='auto').
▸Binary classification (C=2): only one discriminant direction exists (k=1). LDA reduces data to a single number — perfect for plotting and thresholding.
▸Perfectly separated classes: S_W in the separating direction is zero, S_B is large. Fisher's criterion is infinite — a perfect solution, not a problem (unlike logistic regression, LDA handles perfect separation).
▸Imbalanced classes: LDA weights between-class scatter by n_c (class size). Rare classes have small weight in S_B — their mean deviations contribute little. Use class weights or oversample minority classes.

Methodology / Workflow

Role in the ML Pipeline

LDA sits between feature engineering and the final classifier. It simultaneously performs dimensionality reduction and learns a discriminative projection. Often used as: (1) a preprocessing step — compress to k=C-1 dimensions before a simple classifier; or (2) the classifier itself — use projected class means and Gaussian posteriors for prediction. In sklearn, LinearDiscriminantAnalysis can do both in a single estimator.

Data Preprocessing

01.Standardize features if they are on different scales — LDA is sensitive to feature variance (a feature in millions will dominate S_W and S_B). Use StandardScaler before LDA.
02.Handle missing values: impute before LDA. Any NaN breaks the scatter matrix computation.
03.Check class balance: severely imbalanced classes will make the majority class dominate S_B. Consider oversampling (SMOTE) or class weights.
04.Verify Gaussian assumption within each class: plot per-class histograms of each feature. Heavy skew may warrant log-transformation before LDA.
05.Check for near-zero-variance features: these contribute zero to S_B but non-zero (noise) to S_W, hurting the ratio. Drop constant or near-constant features first.
06.If d >= n: apply PCA first to reduce to n-C-1 dimensions, then apply LDA. This is the PCA+LDA pipeline used in Fisherfaces.

Training Process

01.Fit LDA on training data only: lda.fit(X_train, y_train). This computes S_W, S_B, and the eigenvectors from training data.
02.Inspect lda.explained_variance_ratio_ to see how much each discriminant captures of the total between-class scatter.
03.Transform both train and test sets with the same fitted LDA: lda.transform(X_train), lda.transform(X_test). Never refit on test.
04.Visualize: for C <= 4 classes, plot the first 1-2 discriminants. Well-separated clusters confirm LDA found meaningful directions.
05.If using LDA as a classifier: lda.predict() uses the Gaussian generative model with pooled covariance. Compare to lda.transform() + logistic regression for flexible decision boundaries.
06.Tune shrinkage if n is small: set shrinkage='auto' for Ledoit-Wolf automatic shrinkage estimation.

Hyperparameters

Name

n_components

Description

Number of discriminant directions to retain. Must be <= C-1.

Typical

C-1 for maximum discrimination; 1-2 for visualization. Often C-1 is already small.

Name

solver

Description

Algorithm for computing the discriminants. 'svd' (default): uses SVD of centered class data, avoids explicit scatter matrices, supports shrinkage=None. 'lsqr': least-squares solution, supports shrinkage. 'eigen': eigendecomposition of S_W⁻¹ S_B, supports shrinkage.

Typical

'svd' for most cases. 'lsqr' or 'eigen' when using shrinkage.

Name

shrinkage

Description

Regularization parameter. None (default): no shrinkage, uses standard S_W. 'auto': Ledoit-Wolf optimal shrinkage. Float in [0,1]: manual blend coefficient (0 = no shrinkage, 1 = diagonal S_W).

Typical

'auto' when n < 10*d or when getting singular matrix errors. None when n >> d.

Name

priors

Description

Class prior probabilities used in Bayes decision rule. None (default): estimated from class frequencies in training data.

Typical

None unless you have domain knowledge of true class priors different from training data frequency.

Implementation Checklist

1pip install scikit-learn numpy matplotlib
2Load data, check shapes and class balance
3StandardScaler: fit on X_train, transform train and test
4If d >= n-C: apply PCA first (n_components = n-C-1), then LDA
5Fit: lda = LinearDiscriminantAnalysis(n_components=C-1, solver='svd')
6lda.fit(X_train_scaled, y_train)
7Plot explained_variance_ratio_ to understand discriminant quality
8Transform and visualize: Z_train = lda.transform(X_train_scaled); scatter plot colored by class
9Evaluate as classifier: lda.predict(X_test_scaled), lda.predict_proba(X_test_scaled)

Mathematical Chamber

Implementation

python

1import numpy as np
2import matplotlib.pyplot as plt
3
4class LDA:
5    '''
6    Linear Discriminant Analysis — dimensionality reduction and classifier.
7    Supports multi-class via generalized eigenvalue problem.
8    '''
9    def __init__(self, n_components: int = None):
10        self.n_components = n_components
11        self.scalings_ = None      # discriminant directions W (d x k)
12        self.means_ = None         # class means (C x d)
13        self.priors_ = None        # class priors (C,)
14        self.classes_ = None       # unique class labels
15        self.xbar_ = None          # global mean (d,)
16
17    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LDA':
18        n_samples, n_features = X.shape
19        self.classes_ = np.unique(y)
20        n_classes = len(self.classes_)
21
22        if self.n_components is None:
23            self.n_components = n_classes - 1
24
25        # ── Compute class statistics ──────────────────────────────────────────
26        self.priors_ = np.array([np.mean(y == c) for c in self.classes_])
27        self.means_ = np.array([X[y == c].mean(axis=0) for c in self.classes_])
28        self.xbar_ = X.mean(axis=0)
29
30        # ── Within-class scatter S_W ──────────────────────────────────────────
31        S_W = np.zeros((n_features, n_features))
32        for c, mu_c in zip(self.classes_, self.means_):
33            X_c = X[y == c] - mu_c          # centered class data (n_c x d)
34            S_W += X_c.T @ X_c              # accumulate scatter
35
36        # ── Between-class scatter S_B ─────────────────────────────────────────
37        S_B = np.zeros((n_features, n_features))
38        for n_c, mu_c, prior in zip(
39            [np.sum(y == c) for c in self.classes_], self.means_, self.priors_
40        ):
41            diff = (mu_c - self.xbar_).reshape(-1, 1)   # (d, 1)
42            S_B += n_c * (diff @ diff.T)                  # rank-1 update
43
44        # ── Generalized eigenvalue problem: S_W^{-1} S_B w = lambda w ─────────
45        # Use regularization for numerical stability (tiny ridge on S_W)
46        eps = 1e-8
47        S_W_reg = S_W + eps * np.eye(n_features)
48        A = np.linalg.inv(S_W_reg) @ S_B
49
50        eigenvalues, eigenvectors = np.linalg.eig(A)
51        # Keep real parts (imaginary parts should be ~0 for symmetric-like A)
52        eigenvalues = eigenvalues.real
53        eigenvectors = eigenvectors.real
54
55        # Sort by descending eigenvalue (descending Fisher criterion)
56        idx = np.argsort(eigenvalues)[::-1]
57        eigenvalues = eigenvalues[idx]
58        eigenvectors = eigenvectors[:, idx]
59
60        # Store top-k discriminant directions
61        self.scalings_ = eigenvectors[:, :self.n_components]   # (d, k)
62        self.eigenvalues_ = eigenvalues[:self.n_components]
63        total = eigenvalues[:n_classes - 1].sum()
64        if total > 1e-12:
65            self.explained_variance_ratio_ = self.eigenvalues_ / total
66        else:
67            self.explained_variance_ratio_ = np.zeros(self.n_components)
68        return self
69
70    def transform(self, X: np.ndarray) -> np.ndarray:
71        return X @ self.scalings_      # (n, k)
72
73    def fit_transform(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
74        return self.fit(X, y).transform(X)
75
76    def _class_log_likelihood(self, X: np.ndarray) -> np.ndarray:
77        '''Compute log N(x; mu_c, S_W/(n-C)) for each class c.'''
78        n_samples = X.shape[0]
79        n_classes = len(self.classes_)
80        log_liks = np.zeros((n_samples, n_classes))
81        for j, (mu_c, prior) in enumerate(zip(self.means_, self.priors_)):
82            diff = X - mu_c                            # (n, d)
83            # Use Mahalanobis-like distance in projected space (fast approximation)
84            # Full Gaussian needs pooled covariance inverse — use projected means here
85            z = self.transform(X)
86            mu_c_z = mu_c @ self.scalings_            # projected class mean (k,)
87            diff_z = z - mu_c_z
88            sq_dist = np.sum(diff_z ** 2, axis=1)     # squared distance in LD space
89            log_liks[:, j] = -0.5 * sq_dist + np.log(prior + 1e-12)
90        return log_liks
91
92    def predict_proba(self, X: np.ndarray) -> np.ndarray:
93        log_liks = self._class_log_likelihood(X)
94        # Softmax for numerical stability
95        log_liks -= log_liks.max(axis=1, keepdims=True)
96        probs = np.exp(log_liks)
97        probs /= probs.sum(axis=1, keepdims=True)
98        return probs
99
100    def predict(self, X: np.ndarray) -> np.ndarray:
101        return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
102
103
104# ── Demo: 3-class synthetic data ─────────────────────────────────────────────
105np.random.seed(42)
106
107def make_3class(n=150):
108    X0 = np.random.randn(n, 2) @ [[1.2, 0.5], [0.2, 0.8]] + [0, 0]
109    X1 = np.random.randn(n, 2) @ [[0.8, 0.2], [0.3, 1.1]] + [4, 2]
110    X2 = np.random.randn(n, 2) @ [[1.0, 0.4], [0.1, 0.9]] + [2, 5]
111    X = np.vstack([X0, X1, X2])
112    y = np.array([0]*n + [1]*n + [2]*n)
113    return X, y
114
115X, y = make_3class()
116
117lda = LDA(n_components=2)
118Z = lda.fit_transform(X, y)
119
120print('Original shape:', X.shape)   # (450, 2)
121print('Reduced shape: ', Z.shape)   # (450, 2) -- 3 classes -> 2 LDs
122print('Explained variance ratio:', lda.explained_variance_ratio_.round(4))
123
124# Plot
125fig, axes = plt.subplots(1, 2, figsize=(12, 5))
126colors = ['red', 'blue', 'green']
127labels = ['Class 0', 'Class 1', 'Class 2']
128
129for ax, data, title in zip(axes, [X, Z], ['Original 2D', 'LDA Projection']):
130    for c, color, label in zip([0, 1, 2], colors, labels):
131        mask = y == c
132        ax.scatter(data[mask, 0], data[mask, 1],
133                   c=color, label=label, alpha=0.5, s=20)
134    ax.set_title(title); ax.legend()
135
136plt.tight_layout()
137plt.savefig('lda_demo.png', dpi=120)
138print('Saved lda_demo.png')

The from-scratch implementation explicitly builds S_W and S_B, solves S_W^{-1} S_B via np.linalg.eig, and sorts eigenvectors by descending eigenvalue (= descending Fisher criterion). A small ridge (eps * I) is added for numerical stability when S_W is near-singular. Classification uses the projected class means — for production, sklearn's implementation is preferable as it handles edge cases more robustly.

Sample Input

X_train: shape (150, 13) wine features, y_train: 3 classes [0, 1, 2]. After StandardScaler.

Sample Output

Z_train: shape (150, 2) — 2 LDA dimensions (C-1=2). explained_variance_ratio_: [0.6875, 0.3125]. LDA classifier test accuracy: ~0.97 on wine dataset.

Key Implementation Insights

→LDA can produce at most C-1 discriminant directions. For binary classification, this means exactly 1 — you're projecting to a single number. For 10-class problems, you get at most 9 directions regardless of how many features you have.
→Always standardize before LDA. Like PCA, LDA's scatter matrices are dominated by high-variance features if scales differ. S_W and S_B both accumulate raw deviations, so large-scale features swamp the computation.
→When n < d (more features than samples), S_W is singular and S_W⁻¹ does not exist. Fix: apply PCA first (reduce to fewer dimensions than n-C), then LDA. Or use shrinkage='auto' which regularizes S_W.
→LDA as a classifier (sklearn's predict) uses the pooled Gaussian model. If you want a more flexible boundary in the LDA-projected space, use lda.transform() then train logistic regression or SVM on the projected features.
→The Ledoit-Wolf shrinkage estimator (shrinkage='auto') is almost always safe to use — it reduces to standard LDA when n >> d and regularizes appropriately when n is small. Use solver='lsqr' or 'eigen' to enable it.

Common Implementation Mistakes

✗Requesting n_components > C-1. LDA can only produce C-1 meaningful directions (S_B has rank C-1). Sklearn will silently cap n_components at C-1.
✗Forgetting to scale features before LDA. Features with large scale dominate S_W, making it appear that within-class scatter is huge in that direction, suppressing the Fisher ratio.
✗Using LDA when classes have very different covariances. If Σ₁ ≠ Σ₂ substantially, the shared covariance assumption fails and decision boundaries curve — use QDA instead.
✗Applying LDA transform to test data without using the fitted object. Always use lda.transform(X_test) — never refit LDA on test data.
✗Not checking whether S_W is invertible. When d >= n-C, running standard LDA (solver='svd' without shrinkage) silently ignores some features or crashes. Use shrinkage or PCA first.

Dataset Applicability

🏷️

Multi-class Classification with High-D Features

Excellent

LDA is purpose-built for this case. It compresses to C-1 dimensions while preserving all class-discriminating structure. The resulting representation is optimally designed for linear classifiers.

💡 Ensure Gaussian within-class distributions and similar covariances. If violated, QDA or kernel methods perform better.

📋

Small Sample Size (n < 10*d)

Context-Dependent

Standard LDA fails when S_W is singular (d >= n-C). With shrinkage LDA (shrinkage='auto'), performance recovers significantly. Shrinkage stabilizes S_W and often improves generalization.

💡 Always use shrinkage='auto' with solver='lsqr' when n is small relative to d. PCA+LDA pipeline also works well.

🔵

Well-Separated Gaussian Classes

Excellent

LDA is statistically optimal (Bayes-optimal) when classes are truly Gaussian with shared covariance. In this setting, no linear classifier can outperform LDA.

💡 Check normality with QQ-plots per class. If classes are clearly Gaussian and covariances similar, LDA is the theoretically ideal choice.

🌀

Non-Gaussian or Non-Linear Data

Poor

LDA assumes Gaussian class-conditionals and linear decision boundaries. For ring-shaped classes, XOR patterns, or heavy-tailed data, LDA boundaries are fundamentally misspecified.

💡 Use kernel LDA (mlxtend library), SVMs with RBF kernel, or neural networks for non-linear boundaries.

⚖️

Binary Classification (2 classes)

Excellent

Binary LDA reduces to exactly 1 discriminant — a single scalar projection of all features. This makes it highly interpretable and computationally trivial. Competitive with logistic regression under Gaussian assumptions.

💡 LDA handles perfect separation (unlike logistic regression which diverges). But logistic regression is more robust to non-Gaussian features.

📊

Severely Imbalanced Classes

Poor

S_B weights class mean deviations by class size n_c. Rare classes contribute little to S_B, so their separating direction gets low weight. LDA tends to ignore minority classes.

💡 Use class weights (priors parameter), SMOTE oversampling, or custom S_B weighting before LDA on imbalanced data.

Visualizations

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: lda

LDA 1D Projection — Binary Classification

Two Gaussian classes projected onto the single Fisher discriminant direction (LD1). Before projection (2D), classes overlap in both dimensions. After LDA projection (1D), class histograms separate clearly with a natural threshold between them. The threshold is set by the intersection of the two Gaussians — the Bayes decision boundary.

Comparison visualization data is documented in this section.

LDA vs PCA 2D Projection — 3 Classes

The same 3-class dataset projected by PCA (left) and LDA (right). PCA projections maximize total variance — classes may overlap if discriminative directions have low variance. LDA projections maximize between-class scatter relative to within-class scatter — classes are maximally separated. LDA is the correct choice when the goal is classification, not reconstruction.

● Data points · — Regression line (ŷ = -0.02x + 0.06)

Fisher Criterion vs. Shrinkage Alpha

Cross-validation accuracy as the shrinkage parameter alpha varies from 0 (standard LDA) to 1 (fully diagonal within-class scatter). With small n, alpha=0 overfits due to singular or ill-conditioned S_W. The optimal alpha (found by Ledoit-Wolf or grid search) gives the best generalization. The curve typically peaks at an intermediate alpha when n is moderate.

Gradient descent convergence — MSE decreasing over iterations

Advantages & Limitations

Advantages

Uses label information — supervised dimensionality reduction
Unlike PCA, LDA leverages class labels to find directions that are genuinely useful for discrimination. The projection is guaranteed to maximize class separation rather than just raw variance, making the compressed representation directly useful for classification.
Hard theoretical upper bound: at most C-1 dimensions
For a C-class problem, LDA produces at most C-1 discriminant directions. For binary classification, this is 1D — a single interpretable score. For 10 classes, it's at most 9 dimensions regardless of d. This provides automatic, principled dimensionality reduction.
Simultaneously a dimensionality reducer and a classifier
LDA can be used in two ways: (1) transform features to a C-1 dimensional discriminant space, then apply any classifier; (2) directly predict class labels and probabilities via the Gaussian generative model. One fitted model serves both purposes.
Bayes-optimal under Gaussian shared-covariance assumptions
When classes are truly Gaussian with shared covariance, LDA is statistically optimal — no linear classifier achieves lower error on the same data. In this ideal setting, LDA is unbeatable among linear methods.
Handles perfect class separation
Logistic regression diverges (and gives undefined estimates) when classes are perfectly separated. LDA handles perfect separation gracefully — the Fisher criterion simply becomes very large, and the discriminant direction is the clear separating hyperplane.
Efficient inference
At prediction time, LDA computes a matrix-vector product (z = xW) plus a nearest-class-mean comparison. This is O(d * (C-1)) operations — extremely fast for any d or C. Suitable for real-time applications.

Limitations

Assumes Gaussian class-conditionals
LDA's generative model assumes each class follows a multivariate Gaussian. For non-Gaussian distributions (e.g., multi-modal classes, heavy tails, binary features), the model is misspecified. Decision boundaries may be suboptimal and probability estimates unreliable.
Assumes shared covariance across classes
LDA pools within-class scatter into a single covariance estimate. If classes have very different shapes (e.g., one class is spherical, another is elongated), the pooled estimate is a poor approximation for both. QDA relaxes this but requires more data.
Singular S_W when d >= n-C
The within-class scatter matrix has rank at most n-C. When d >= n-C (high-dimensional small-sample problems), S_W is singular and uninvertible. Standard LDA fails entirely. Fix with shrinkage LDA or a PCA preprocessing step.
Hard limit of C-1 dimensions can be too few
For binary classification, LDA produces a single number. If the true separation requires a 2D decision region (e.g., XOR-like patterns), LDA cannot capture it regardless of how sophisticated the model is. The C-1 limit is a feature but also a constraint.
Linear decision boundaries only
LDA decision boundaries are always hyperplanes. Non-linear class boundaries (spiral data, ring data, XOR) cannot be captured. Kernel LDA extends to non-linear boundaries but adds hyperparameter complexity.
Sensitive to outliers via scatter matrices
Like PCA's covariance matrix, LDA's scatter matrices are sensitive to outliers. A single extreme sample can inflate S_W in a direction, depressing the Fisher criterion and distorting the discriminant direction. Winsorize or use robust scatter estimates before LDA.

Practical Use Cases

Computer Vision

Fisherfaces — face recognition under varying lighting

Eigenfaces (PCA) fail when lighting varies within a class because illumination changes occupy high-variance directions that PCA preserves. LDA suppresses lighting variation (within-class scatter) and amplifies identity information (between-class scatter). Fisherfaces (PCA to n-C dimensions, then LDA to C-1) achieves significantly better recognition rates under varying illumination conditions.

Medicine / Genomics

Cancer subtype classification from gene expression

Tumor samples are characterized by tens of thousands of gene expression measurements (d >> n). PCA+LDA or shrinkage LDA compresses these to 2-5 discriminant directions separating cancer subtypes. Clinically, this enables a simple linear score from gene expression to predict subtype — interpretable and deployable in diagnostic labs.

Finance

Credit scoring — linear discriminant score

LDA applied to hundreds of financial features (income, debt ratios, payment history) produces a single linear discriminant score separating defaulters from non-defaulters. This score is easily interpretable, auditable (required in regulated industries), and computationally trivial to deploy. LDA-derived scores remain competitive with more complex models.

Natural Language Processing

Document topic discrimination

TF-IDF or embedding features (high-dimensional) are projected via LDA to separate document topics. LDA finds the linear combination of word frequencies or embedding dimensions that best separates topic classes. The resulting low-dimensional space enables fast similarity search and visualization of document separability.

Neuroscience

Brain-Computer Interface — decoding neural states

EEG or fMRI recordings have hundreds of electrode or voxel features. LDA finds discriminant directions separating mental states (imagined left-hand vs. right-hand movement). The 1D or 2D projection enables real-time classification of neural signals for brain-computer interface control, where speed and simplicity are critical.

Comparison

LDA's closest relatives are PCA (unsupervised counterpart), QDA (relaxes shared covariance), and logistic regression (discriminative counterpart). Understanding the distinctions is essential for model selection.

PCA

Similarity

Both are linear dimensionality reduction methods that project data to a lower-dimensional subspace

Key Difference

PCA is unsupervised — it maximizes variance without using class labels. LDA is supervised — it maximizes between-class scatter relative to within-class scatter. PCA can keep up to d components; LDA at most C-1. A low-variance direction in PCA might be the most discriminative direction for LDA.

Choose When

PCA when no labels are available, or when reconstruction quality matters. LDA when labels are available and the goal is classification or class visualization.

QDA (Quadratic Discriminant Analysis)

Similarity

Both are generative classifiers modeling Gaussian class-conditionals with Bayes' rule

Key Difference

LDA assumes shared covariance (Σ₁ = Σ₂ = ... = Σ_C = Σ_pooled) — linear boundaries. QDA fits a separate covariance matrix per class — quadratic (curved) boundaries. QDA has more parameters (C * d² vs d²) and needs more data per class. QDA is more flexible; LDA is more regularized and generalizes better with small n.

Choose When

LDA when n is small or covariances are similar across classes. QDA when covariances differ substantially and n is large enough to estimate C covariance matrices reliably.

Logistic Regression

Similarity

Both produce linear decision boundaries; both can be used for binary and multi-class classification

Key Difference

LDA is generative (models p(x|y) then applies Bayes' rule). Logistic regression is discriminative (directly models p(y|x)). LDA assumes Gaussian features; logistic regression makes no distributional assumption on x. Logistic regression is more robust to non-Gaussian features. LDA simultaneously reduces dimensionality; logistic regression operates in original feature space.

Choose When

LDA when Gaussian assumption is plausible and dimensionality reduction is desired. Logistic regression for robust classification without distributional assumptions, especially with binary or count features.

SVM (Linear Kernel)

Similarity

Both produce linear decision boundaries for classification

Key Difference

SVM maximizes the margin between support vectors — a geometric criterion based on specific training points. LDA maximizes the Fisher ratio — a statistical criterion based on all scatter. SVM is more robust to outliers (only support vectors matter); LDA is influenced by all points via scatter matrices. LDA provides class probabilities; standard SVMs do not.

Choose When

LDA when probabilistic outputs are needed and Gaussian assumptions hold. Linear SVM when margin maximization is preferred and probabilistic calibration is not required.

Property	LDA	PCA	QDA	Logistic Regression
Supervised	✓ Yes	✗ No	✓ Yes	✓ Yes
Linear boundary	✓ Yes	N/A	✗ Quadratic	✓ Yes
Max components	C-1	d	C-1	N/A
Generative model	✓ Yes	✗ No	✓ Yes	✗ No
Shared covariance	✓ Required	N/A	✗ Per-class	N/A
Handles d > n	Shrinkage	PCA trick	Rarely	Regularized

Choose Linear Discriminant Analysis when:

You have class labels, want dimensionality reduction and classification in one step, your data is approximately Gaussian within each class with similar covariances, and you need an interpretable linear score. Especially strong for visualization of class separability in 1-2 dimensions.

Evaluation

Explained Variance Ratio (Between-Class)

Fraction of total between-class scatter captured by each discriminant. Unlike PCA's EVR (which uses total variance), LDA's EVR describes what fraction of class-discriminating information each LD captures.

Target: LD1 often captures 70-90% in well-structured data. Cumulative EVR for retained LDs should approach 1.0.

Classification Accuracy / F1

Primary downstream evaluation. Compare LDA+classifier vs. PCA+classifier vs. raw features+classifier via cross-validation to verify LDA actually improves performance.

Target: Context-dependent. Key is that LDA CV accuracy > PCA CV accuracy or raw features CV accuracy for the approach to be worthwhile.

Mahalanobis Distance (Class Separation)

For binary LDA, Mahalanobis distance between class means in original space. Large Δ² means classes are well separated accounting for covariance. LDA's discriminant is optimal for separating classes when this distance is maximized.

Target: Δ² > 2 generally indicates good separability; depends on problem.

Within-class to Between-class Scatter Ratio

The maximum Fisher criterion value (largest eigenvalue of S_W⁻¹ S_B) is a dataset-level measure of LDA-separability. High values mean the best discriminant direction is very clean. Low values mean classes are inherently hard to separate linearly.

Target: J* > 1 suggests clear discriminative structure. J* < 0.1 suggests classes barely separate linearly.

Evaluation Process

01.1. Compute LDA and inspect lda.explained_variance_ratio_ — verify LD1 captures substantial between-class scatter.
02.2. Plot 2D scatter of LD1 vs LD2 colored by class — visually verify class clusters are well-separated.
03.3. Evaluate as classifier: lda.score(X_test, y_test) and classification_report for per-class metrics.
04.4. Compare via cross-validation: LDA vs. PCA+LR vs. raw+LR. LDA should win when Gaussian assumptions hold.
05.5. Check confusion matrix: identify which classes LDA struggles to separate — these may need more data or a non-linear boundary.
06.6. If using shrinkage: sweep shrinkage values and plot CV accuracy vs. alpha to validate the chosen regularization.

Evaluation Traps

▸Evaluating LDA on training data only — the Gaussian model fits the training distribution exactly, giving optimistically high accuracy. Always use held-out test data or cross-validation.
▸Ignoring the Gaussian assumption: if class-conditional histograms are clearly non-Gaussian (bimodal, skewed), LDA's probability estimates are poorly calibrated even if the decision boundary happens to work.
▸Choosing n_components based only on explained variance ratio without checking downstream accuracy — sometimes LD2 has small EVR but is critical for separating one pair of classes.
▸Forgetting that LDA decision boundaries are linear in original feature space, not in discriminant space. Two LDs give a 2D linear projection, not a 2D non-linear embedding.

Real-World Interpretation Example

Wine classification (3 classes, 13 features): LDA reduces to 2 discriminants. LD1 explains 68.7% of between-class scatter, LD2 explains 31.3%. 2D plot shows near-perfect separation of all 3 wine classes. LDA classifier achieves 97.8% CV accuracy vs. 95.4% for logistic regression and 96.2% for PCA(2)+LR. The Gaussian assumption is plausible for wine measurements, and LDA's supervised reduction outperforms unsupervised PCA for this classification task.

Common Mistakes

Students

×Confusing LDA (Linear Discriminant Analysis) with LDA (Latent Dirichlet Allocation) — both are abbreviated LDA. In ML/statistics, LDA most commonly means Fisher's Linear Discriminant Analysis (this topic). Latent Dirichlet Allocation is a topic model for text.
×Thinking LDA finds up to d components like PCA. LDA is hard-capped at C-1 discriminant directions regardless of d. For 2 classes, you always get 1 component.
×Applying LDA without class labels. LDA is supervised — it requires y. Without labels, use PCA or other unsupervised methods.
×Not realizing that LDA's between-class scatter S_B has rank C-1, so any eigenvalues beyond the (C-1)-th are exactly zero and non-informative.

Developers

×Not checking whether S_W is invertible before running LDA. If d >= n-C and solver='svd', sklearn may silently compute a degraded result. Always check X.shape relative to n_classes and use shrinkage when needed.
×Using lda.transform(X_test) without first fitting on X_train — the transform applies the projection from the fitted model. A common pipeline error is accidentally calling fit_transform on test data.
×Forgetting that sklearn's LinearDiscriminantAnalysis predict() uses the generative model (nearest projected class mean with Gaussian likelihood). If you want a different classifier in the projected space, call transform() then fit a separate model.
×Setting n_components = C-1 explicitly when C is unknown at definition time. Use n_components=None and let sklearn set it automatically to C-1.

In Interviews

×Saying LDA is unsupervised. LDA is a supervised method — it uses class labels. This is the fundamental difference from PCA. Getting this wrong in an interview is a red flag.
×Not knowing the maximum number of components. 'LDA can produce at most C-1 discriminant directions' is a critical fact — forgetting it suggests surface-level knowledge.
×Confusing LDA with logistic regression. Both produce linear boundaries but logistic regression is discriminative (models p(y|x) directly), while LDA is generative (models p(x|y) with Gaussian assumption). They make different assumptions and have different optimality properties.
×Not knowing when LDA fails — specifically the singular S_W problem when d >= n-C. An interviewer asking 'what are LDA's limitations?' expects this answer.

Real Projects

×Ignoring class imbalance. S_B is weighted by class size — rare classes contribute little. Minority class decision regions get pulled toward majority class. Use priors or resampling.
×Using LDA for non-Gaussian data without validation. Running LDA on count data, binary features, or heavily skewed distributions produces a valid projection but the probability estimates are miscalibrated.
×Not checking for distribution shift at inference time. LDA's scatter matrices and class means are estimated on training data. If test distribution shifts (new class means, different covariance), LDA degrades silently.
×Using LDA in isolation without comparing to simpler (raw features + logistic regression) or more complex (QDA, SVM) baselines in cross-validation.

Core ML Thinking Lens

What kind of bias does this model have?

Linear assumptions create bias when relationships are strongly non-linear.

What kind of variance does it have?

Usually lower variance than high-capacity non-linear models.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use L1/L2 regularization, feature pruning, and stronger validation controls.

What kind of data does it like?

Works best with clean, informative features and stable train/serve distributions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

LDA is supervised dimensionality reduction: it uses class labels to find directions that maximize between-class scatter / within-class scatter (Fisher's criterion)
Maximum components = C-1 for C classes. Binary LDA: 1 component. 10-class LDA: at most 9 components.
Two scatter matrices: S_W (within-class, want small) and S_B (between-class, want large)
Discriminant directions are eigenvectors of S_W⁻¹ S_B, sorted by descending eigenvalue (= Fisher criterion value)
LDA also works as a generative classifier: Gaussian class-conditionals with shared covariance → linear decision boundaries via Bayes' rule
When S_W is singular (d >= n-C): use shrinkage LDA (shrinkage='auto') or apply PCA first
LDA vs PCA: supervised vs unsupervised, class-discriminative vs variance-preserving, C-1 vs d max components
QDA relaxes shared covariance (per-class Σ_c) at the cost of more parameters — use when covariances differ substantially

Critical Formulas

Within-Class Scatter

Between-Class Scatter

Fisher's Criterion

Discriminant Directions

Binary Optimal Direction

Shrinkage Regularization

Best For

✓Multi-class classification with Gaussian, similarly-covaried classes
✓Visualizing class separability in 1-2 dimensions when C is small
✓Supervised compression before a simple linear classifier
✓When you need interpretable class-discriminating linear combinations of features
✓When perfect class separation exists (unlike logistic regression, LDA handles this)

Avoid When

✗No class labels available (use PCA)
✗Data is highly non-Gaussian within classes (use QDA, SVM, or neural nets)
✗Class covariances differ substantially (use QDA)
✗d >= n-C without applying shrinkage or PCA first
✗Non-linear decision boundaries are needed (use kernel LDA or SVMs)

Interview Must-Know

★LDA maximizes Fisher's criterion: between-class scatter / within-class scatter — know both scatter matrices

★Maximum discriminant directions = C-1 — explain why (rank of S_B)

★LDA as a generative classifier: Gaussian class-conditionals, shared covariance, Bayes' rule → linear boundary

★LDA vs PCA: supervised vs unsupervised, discriminative vs variance-preserving

★LDA vs QDA: shared vs per-class covariance, linear vs quadratic boundary

★Singular S_W problem and fix: shrinkage LDA or PCA preprocessing

★Connection to logistic regression: same boundary form, different derivation (generative vs discriminative)

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.