Principal Component Analysis

Concept Overview

In Plain English

PCA finds the directions in your data where variance is maximized, then projects your data onto those directions. It's a rotation that reframes your data so the first axis captures the most spread, the second captures the next most, and so on — letting you drop later axes with minimal information loss.

Why It Exists

High-dimensional data is simultaneously expensive to compute on, hard to visualize, and prone to the curse of dimensionality. PCA was developed to find compact, uncorrelated representations that preserve as much statistical structure (variance) as possible.

Problem It Solves

Given a dataset with d features that are often correlated, find k < d orthogonal directions (principal components) that together explain the majority of variance in the data, enabling dimensionality reduction without discarding critical information.

Real-Life Analogy

"Imagine photographing a 3D object. Depending on the angle, you capture more or less of the object's shape. PCA finds the camera angle from which the object casts the most spread-out shadow — that 2D shadow is your compressed representation. It throws away depth (a third dimension), but you chose the angle where depth was least important."

When To Use

Features are highly correlated and you want to decorrelate them before modeling
You need to visualize high-dimensional data in 2D or 3D
Training is slow due to too many features and you need to reduce dimensionality
You're building a feature extraction pipeline for images or text embeddings
Removing noise from data where signal is concentrated in a low-dimensional subspace
Before clustering (e.g., k-means) to avoid the curse of dimensionality

When NOT To Use

Features are already uncorrelated — PCA provides no benefit and costs interpretability
You need interpretable features — principal components are linear combinations with no semantic meaning
Data has non-linear structure (use t-SNE, UMAP, or autoencoders instead)
Your dataset is tiny (< 50 samples) — PCA components are unreliable with sparse data
Target variable matters for compression — PCA is unsupervised and ignores labels (use LDA instead)

Core Intuition

Every dataset lives in a d-dimensional space, but the data usually doesn't fill that space uniformly — it's stretched in some directions and compressed in others. PCA identifies those stretched directions. Mathematically, it solves: 'Find the unit vector w such that the variance of the projection Xw is maximized.' This vector is the first principal component.

After finding the first component, PCA finds the second: the direction of maximum remaining variance, constrained to be orthogonal (perpendicular) to the first. Then the third, orthogonal to both previous ones. And so on. The result is a new coordinate system — a rotation of the original — where axes are uncorrelated and ordered by importance.

The magic is that in practice, most real datasets have their variance concentrated in far fewer dimensions than the raw feature count. Image patches live on a low-dimensional manifold. Stock returns are driven by a handful of macro factors. Gene expression is controlled by a few regulatory pathways. PCA exposes this hidden low-dimensionality.

The Metaphor

"Imagine a cloud of points shaped like a squashed football (American football / rugby ball). The football points mostly left-right (high variance axis) and barely top-to-bottom (low variance axis). PCA finds the 'long axis' of the football and the 'short axis.' If you project every point onto just the long axis, you lose very little information — the short axis was barely doing anything. That's PCA: find the long axis, project, discard the short axis."

Beginner Mental Model

Think of PCA as rotating your data to find the 'most spread-out' direction, calling that the first axis (PC1), then finding the next most spread-out direction that's perpendicular to PC1 (that's PC2), and so on. Once you've found these new axes, you can keep only the first few — the ones that explain most variance — and get a compressed, decorrelated dataset.

Technical Theory

Formal Definition

Given a centered data matrix X ∈ ℝⁿˣᵈ (n samples, d features, zero mean), PCA finds an orthonormal basis W = [w₁, w₂, ..., wₖ] ∈ ℝᵈˣᵏ where each wⱼ is an eigenvector of the sample covariance matrix C = XᵀX/(n-1). The columns of W are ordered by descending eigenvalue λ₁ ≥ λ₂ ≥ ... ≥ λₖ. The projected (reduced) data is Z = XW ∈ ℝⁿˣᵏ.

Key Terms

Principal Component (PC): A direction in feature space (a d-dimensional unit vector) along which variance is maximized subject to orthogonality with all previous PCs. The j-th PC is the j-th eigenvector of the covariance matrix.
Eigenvector: A vector v such that Mv = λv for a matrix M. In PCA, eigenvectors of C define the principal component directions; they don't change direction when the covariance matrix acts on them, only scale.
Eigenvalue (λⱼ): The scalar associated with an eigenvector. In PCA, λⱼ equals the variance of the data projected onto the j-th principal component. Larger eigenvalue = more variance explained.
Covariance Matrix (C): A d×d symmetric positive semi-definite matrix where Cᵢⱼ = cov(xᵢ, xⱼ). Its diagonal entries are variances; off-diagonal entries are covariances. PCA diagonalizes C in the PC basis.
Explained Variance Ratio: The fraction of total variance explained by the j-th PC: λⱼ / Σᵢλᵢ. Summing the top-k ratios gives the total variance retained by a k-component PCA.
Scree Plot: A bar chart or line plot of eigenvalues (or explained variance ratios) in descending order. The 'elbow' point — where the curve bends — suggests the optimal number of components to retain.
Reconstruction: Approximating the original data from reduced components: X̂ = ZWᵀ. Reconstruction error = ||X - X̂||²_F = sum of discarded eigenvalues × n.
Loading Vector: The weights in a principal component direction — each element of wⱼ tells you how much the corresponding original feature contributes to the j-th PC.

Step-by-Step Working

1. Center the data: subtract the column mean from each feature. X_centered = X - mean(X, axis=0). (Optional: also standardize by dividing by std if features are on different scales.)
2. Compute the sample covariance matrix: C = XᵀX / (n - 1). Shape: d×d.
3. Compute eigenvectors and eigenvalues of C: C = VΛVᵀ where V contains eigenvectors as columns and Λ is diagonal with eigenvalues.
4. Sort eigenvectors by descending eigenvalue: λ₁ ≥ λ₂ ≥ ... ≥ λd.
5. Choose k: inspect scree plot, select k components explaining ≥ 85–95% of variance.
6. Form projection matrix W = [v₁, v₂, ..., vₖ] ∈ ℝᵈˣᵏ (first k eigenvectors).
7. Project data: Z = X_centered @ W. Shape: n×k. These are the principal component scores.
8. (Optional) Reconstruct: X̂ = Z @ Wᵀ + mean(X). Compute reconstruction error to validate k.

Inputs

Numeric feature matrix X ∈ ℝⁿˣᵈ. All features must be numeric and complete. Features should be standardized if they have different units or scales.

Outputs

Projected matrix Z ∈ ℝⁿˣᵏ (k << d) containing principal component scores. Also: component vectors W ∈ ℝᵈˣᵏ, explained variance ratios, and optional reconstruction X̂.

Model Assumptions

01Linearity: PCA finds linear combinations of features. Non-linear structure (spirals, manifolds) requires Kernel PCA or UMAP.

02Variance = information: PCA equates high variance with high information. If signal has low variance relative to noise, PCA may compress signal and keep noise.

03Features are on comparable scales (or are standardized): otherwise features with large absolute variance dominate all PCs.

04No missing values: PCA requires complete data. Use imputation before applying PCA.

05Gaussian-like marginals: PCA is optimal for Gaussian data. For heavy-tailed or skewed features, results may be suboptimal.

Important Edge Cases

▸d > n (more features than samples): The covariance matrix C (d×d) has rank ≤ n-1 — at most n-1 non-zero eigenvalues. Use SVD of X directly rather than eigendecomposition of C for numerical stability.
▸Perfectly correlated features: covariance matrix is singular. Some eigenvalues are zero — the corresponding PCs explain 0 variance and should be discarded.
▸All features have same variance: all eigenvalues are equal — there's no dominant direction. All PCs explain equal variance. PCA rotation is arbitrary in this case.
▸k = d: No dimensionality reduction — retaining all PCs is equivalent to a pure rotation (lossless). Total explained variance = 100%.

Methodology / Workflow

Role in the ML Pipeline

PCA typically sits between feature engineering and model training. After encoding and scaling features, PCA reduces dimensionality before passing data to classifiers, regressors, or clustering algorithms. It can also be used at the end of a pipeline for visualization (project to 2D after training).

Data Preprocessing

01.Handle missing values: impute before PCA — any NaN causes PCA to fail. KNN imputation or median imputation are common choices.
02.Encode categorical features: PCA operates on numeric data. One-hot encode categoricals first.
03.StandardScaler: CRITICAL before PCA unless all features share the same unit. PCA finds directions of maximum variance; unscaled features with large absolute ranges will dominate.
04.Check for outliers: extreme outliers can pull principal components. Consider winsorizing (capping) outliers or using Robust PCA.
05.Remove zero-variance features: features with variance zero (constants) contribute nothing and can cause numerical issues.

Training Process

01.Fit PCA on training data only (call fit on X_train). This captures the training data's covariance structure.
02.Transform both train and test sets using the same fitted PCA (call transform). Never refit on test — that would be data leakage.
03.Inspect explained_variance_ratio_: plot cumulative sum to choose k (find the elbow or cross the 95% threshold).
04.Validate k: fit downstream model at several k values, compare cross-validation performance. PCA+model is a single pipeline in sklearn.
05.Examine component loadings (pca.components_) to understand what each PC represents.
06.Compute reconstruction error at chosen k to quantify information loss: ||X_train - X̂_train||_F².

Hyperparameters

Name

n_components (k)

Description

Number of principal components to retain. Most important hyperparameter.

Typical

Choose via scree plot or cross-validation. Often 10–50 for images, 2–3 for visualization.

Name

svd_solver

Description

Algorithm used to compute SVD. 'full' uses LAPACK; 'randomized' uses Halko et al. approximation.

Typical

'randomized' for n > 500 and d > 500; 'full' for small data

Name

whiten

Description

If True, divides each PC by its standard deviation, making components unit variance.

Typical

False (default). True useful before algorithms like ICA or k-means.

Implementation Checklist

1pip install scikit-learn numpy matplotlib
2Load data and check for missing values, dtypes
3Fit StandardScaler on training features, transform train and test
4Fit PCA on X_train_scaled (start with n_components=d or n_components=0.95 for auto-selection)
5Plot cumulative explained variance — pick k at elbow or 95% threshold
6Transform data: X_train_pca = pca.transform(X_train_scaled)
7Feed X_train_pca into downstream model. Wrap in sklearn Pipeline for cleanliness.

Mathematical Chamber

Implementation

python

1import numpy as np
2import matplotlib.pyplot as plt
3
4class PCA:
5    def __init__(self, n_components: int):
6        self.n_components = n_components
7        self.components_ = None       # W: shape (n_components, d)
8        self.explained_variance_ = None          # eigenvalues
9        self.explained_variance_ratio_ = None
10        self.mean_ = None
11
12    def fit(self, X: np.ndarray) -> "PCA":
13        n_samples, n_features = X.shape
14
15        # Step 1: Center
16        self.mean_ = X.mean(axis=0)                          # (d,)
17        X_c = X - self.mean_                                 # (n, d)
18
19        # Step 2: Covariance matrix
20        # Use (n-1) denominator (unbiased)
21        C = (X_c.T @ X_c) / (n_samples - 1)                 # (d, d)
22
23        # Step 3: Eigendecomposition
24        eigenvalues, eigenvectors = np.linalg.eigh(C)
25        # eigh returns ascending order — reverse to descending
26        idx = np.argsort(eigenvalues)[::-1]
27        eigenvalues  = eigenvalues[idx]
28        eigenvectors = eigenvectors[:, idx]                  # (d, d)
29
30        # Step 4: Store top-k
31        self.components_ = eigenvectors[:, :self.n_components].T  # (k, d)
32        self.explained_variance_ = eigenvalues[:self.n_components]
33        total_var = eigenvalues.sum()
34        self.explained_variance_ratio_ = self.explained_variance_ / total_var
35        return self
36
37    def transform(self, X: np.ndarray) -> np.ndarray:
38        X_c = X - self.mean_
39        return X_c @ self.components_.T                      # (n, k)
40
41    def fit_transform(self, X: np.ndarray) -> np.ndarray:
42        return self.fit(X).transform(X)
43
44    def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
45        return Z @ self.components_ + self.mean_             # (n, d)
46
47    def reconstruction_error(self, X: np.ndarray) -> float:
48        Z = self.transform(X)
49        X_hat = self.inverse_transform(Z)
50        return float(np.mean((X - X_hat) ** 2))             # MSE
51
52
53# ── Demo: 3D data → 2D ───────────────────────────────────────────────────────
54np.random.seed(42)
55n = 200
56# True signal lives on a 2D plane embedded in 3D
57t = np.random.randn(n, 2)
58noise = np.random.randn(n, 3) * 0.1
59A = np.array([[1, 0.8, 0.2],
60              [0, 0.6, 0.9]])          # mixing matrix
61X = t @ A + noise                     # (200, 3)
62
63pca = PCA(n_components=2)
64Z = pca.fit_transform(X)
65
66print("Original shape:", X.shape)     # (200, 3)
67print("Reduced shape: ", Z.shape)     # (200, 2)
68print("Explained variance ratios:", pca.explained_variance_ratio_.round(4))
69print("Cumulative variance: {:.1f}%".format(
70    pca.explained_variance_ratio_.sum() * 100))
71print("Reconstruction MSE:", round(pca.reconstruction_error(X), 6))
72
73# Scree plot
74from sklearn.decomposition import PCA as skPCA
75pca_full = skPCA().fit(X)
76plt.figure(figsize=(8, 3))
77plt.subplot(1, 2, 1)
78plt.bar(range(1, 4), pca_full.explained_variance_ratio_ * 100)
79plt.xlabel("Principal Component"); plt.ylabel("% Variance Explained")
80plt.title("Scree Plot")
81plt.subplot(1, 2, 2)
82plt.scatter(Z[:, 0], Z[:, 1], alpha=0.6)
83plt.xlabel("PC1"); plt.ylabel("PC2")
84plt.title("2D Projection")
85plt.tight_layout(); plt.savefig("pca_demo.png", dpi=120)

np.linalg.eigh is used instead of eig because C is symmetric positive semi-definite — eigh exploits symmetry for faster, numerically stable decomposition and guarantees real eigenvalues. The from-scratch version and sklearn produce identical results (up to sign flip on component vectors, which is arbitrary).

Sample Input

X_train: shape (1000, 64) — 64-pixel flattened image patches, values in [0, 255]. After StandardScaler.

Sample Output

Z_train: shape (1000, 29) — 29 principal components retaining 95% variance. explained_variance_ratio_: [0.148, 0.102, 0.078, ...]. Reconstruction MSE: 0.023.

Key Implementation Insights

→ALWAYS scale before PCA when features have different units. Without scaling, the feature with the largest variance (often just the one measured in bigger numbers) will dominate all principal components.
→Use n_components=0.95 in sklearn's PCA to automatically select k so that 95% of variance is retained — much more robust than hardcoding k.
→Fit PCA on training data only. Transform both train and test. Fitting on all data leaks test distribution info into the transformation.
→Principal components can have arbitrary sign — PC1 in one run might point opposite to another run. Signs are not meaningful; only directions and magnitudes matter.
→For n >> d (many more samples than features), use PCA on the covariance matrix (d×d). For d >> n, use SVD of X directly — sklearn does this automatically via svd_solver='auto'.

Common Implementation Mistakes

✗Fitting StandardScaler or PCA on the combined (train+test) data before splitting — this is data leakage.
✗Not centering data before computing the covariance matrix — the covariance formula assumes zero mean.
✗Interpreting PCA components as original features — they're linear combinations with no direct semantic meaning.
✗Using PCA to remove outliers — PCA is sensitive to outliers. Use Robust PCA (sklearn has no built-in — use pyRMT or manual RPCA) for outlier-contaminated data.
✗Forgetting to apply the same transformation to test data — always use pca.transform(X_test), never pca.fit_transform(X_test).

Dataset Applicability

📐

High-Dimensional Tabular (d > 100)

Excellent

PCA shines when many features are correlated. It compresses correlated features into uncorrelated PCs, dramatically reducing input size while retaining structure.

💡 Check VIF or correlation matrix first — if features aren't correlated, PCA offers little compression.

🖼️

Image Data (pixels)

Excellent

Natural images have strong correlations between adjacent pixels. PCA (Eigenfaces) reduces 10,000-pixel faces to ~100 dimensions while preserving identity-discriminating structure.

💡 Apply to flattened or patch representations. For convolutional features, PCA is often applied post-CNN embedding.

📋

Small Dataset (< 100 rows)

Poor

With few samples, estimated covariance matrix is unreliable. Eigenvectors computed from n=50 samples in d=100 dimensions are mostly noise.

💡 If n < d, use SVD directly and limit k severely. Consider simpler feature selection instead.

📈

Time Series Data

Context-Dependent

PCA can extract common factors across multiple time series (e.g., stock returns). But temporal structure (autocorrelation) is ignored — PCA treats rows as i.i.d.

💡 Use with care. Consider Dynamic Factor Models or sliding-window PCA for temporal data.

💬

Sparse Data (NLP bag-of-words)

Good

Truncated SVD (sklearn's TruncatedSVD) is the PCA variant for sparse matrices — equivalent to PCA but skips centering (centering destroys sparsity). Also known as LSA.

💡 Do NOT use standard PCA on sparse data — it will densify the matrix and blow up memory. Use TruncatedSVD.

🌀

Non-Linear Manifold Data (Swiss roll, circles)

Poor

PCA can only find linear structure. Data on non-linear manifolds (curved surfaces, concentric rings) will not compress well — PCA preserves global distances, not local manifold structure.

💡 Use UMAP, t-SNE, or Kernel PCA (sklearn.decomposition.KernelPCA with RBF kernel) for non-linear dimensionality reduction.

Visualizations

Interactive: Projection Direction and Explained Variance

Projection angle: 25 degrees

Scree Plot — Explained Variance per Component

Bar chart showing how much variance each principal component explains. Look for the 'elbow' — the point where adding more components gives diminishing returns. The red dashed line shows the cumulative threshold for 95% variance.

Comparison visualization data is documented in this section.

2D PCA Projection — Iris Dataset

Each point is a flower sample projected onto its first two principal components. Colors represent the three species. Well-separated clusters indicate that PC1 and PC2 capture the species-discriminating variance, even though PCA used no label information.

● Data points · — Regression line (ŷ = -0.08x + 0.37)

Reconstruction Error vs. Number of Components

Mean squared reconstruction error decreases as more components are retained. The sharp drop in the first few components confirms most variance is concentrated there. After the elbow, adding components provides diminishing error reduction.

Gradient descent convergence — MSE decreasing over iterations

Advantages & Limitations

Advantages

Removes multicollinearity
Principal components are by construction orthogonal (uncorrelated). Any downstream linear model applied to PC scores will never suffer from multicollinearity — VIFs are exactly 1 for all components.
Reduces overfitting in downstream models
Compressing d features to k << d principal components reduces the effective number of model parameters. The regularization effect is strongest when discarded PCs contain mostly noise.
Noise reduction
When signal is concentrated in the top PCs and noise is spread across all directions uniformly, discarding bottom PCs removes noise. This is why PCA-preprocessed data can give better downstream model accuracy than raw features.
Fast and exact (closed-form)
PCA has a unique, analytically computed solution. No iterative optimization, no local minima, no random initialization. For n < 100K and d < 10K, SVD computation completes in seconds.
Excellent for visualization
Projecting to 2D or 3D with PCA is the fastest way to sanity-check clustering structure, class separability, or outliers in a high-dimensional dataset. Standard first step in any EDA workflow.
Information-theoretically optimal for linear compression
Under Gaussian assumptions, PCA is the optimal linear autoencoder — no other linear projection of dimension k loses less information (in the MSE sense). The Eckart-Young theorem formalizes this.

Limitations

Linear only — misses non-linear structure
PCA can only find linear subspaces. Swiss roll, concentric circles, or any manifold with curvature will not be compressed effectively. UMAP and t-SNE handle these but are slower and stochastic.
Destroys interpretability
Principal components are linear combinations of ALL original features. You can no longer say 'this variable caused this prediction.' PC1 might mix age + income + education — a black box direction.
Sensitive to feature scaling
Without StandardScaler, features with large absolute variance dominate all PCs. This is a silent failure mode — you think you're doing PCA on all features but you're effectively PCA-ing only the largest-scale feature.
Discards low-variance directions that may be class-discriminating
PCA is unsupervised — it ignores labels. A direction with low variance might be exactly what separates classes. Use Linear Discriminant Analysis (LDA) when labels are available and discrimination is the goal.
Not robust to outliers
Outliers inflate covariance estimates and pull principal component directions toward themselves. A single extreme point can dramatically rotate PC1 away from the true signal direction.

Practical Use Cases

Computer Vision

Eigenfaces — face recognition preprocessing

Each 100×100 face image is a 10,000-dimensional vector. PCA reduces this to ~150 principal components (Eigenfaces) that capture lighting, orientation, and identity. A nearest-neighbor classifier in this 150D space achieves competitive face recognition accuracy.

Finance

Factor model construction for portfolio optimization

PCA on a correlation matrix of 500 stock returns extracts market factor (PC1, explains ~40% variance), industry sectors (PC2-PC5), and style factors. These factors drive risk model construction in BlackRock, AQR, and most quant hedge funds.

Genomics / Bioinformatics

Population stratification in GWAS studies

Genome-wide association studies have ~1M SNP features per person. PCA of the genotype matrix reveals population clusters (European, African, Asian ancestry) as distinct regions in PC1-PC2 space. These PCs are used as covariates to control for ancestry confounding.

NLP / Information Retrieval

Latent Semantic Analysis (LSA)

TF-IDF document-term matrix (sparse, 50K×100K) is decomposed via truncated SVD (equivalent to PCA without centering) into 300-dimensional topic vectors. Documents with similar topics cluster in this space regardless of exact word choice.

Quality Control / Manufacturing

Process monitoring with multivariate control charts

Semiconductor fabrication involves 100+ process parameters measured per wafer. PCA reduces these to ~5 PCs representing 'normal variation modes.' A wafer deviating from normal in PC space triggers an alert — much more sensitive than univariate control charts.

Comparison

PCA is the workhorse of linear dimensionality reduction. Here's how it compares to the most important alternatives:

t-SNE

Similarity

Both reduce dimensionality for visualization

Key Difference

t-SNE is non-linear, preserves local neighborhood structure, and is only suitable for 2D/3D visualization (not general compression). PCA preserves global variance structure and can produce any k dimensions.

Choose When

t-SNE for visualization of cluster structure; PCA for general compression and preprocessing.

UMAP

Similarity

Both project high-dimensional data to lower dimensions

Key Difference

UMAP preserves both local and global structure better than t-SNE. It's much faster and produces stable results usable for downstream tasks. But it's non-linear and stochastic.

Choose When

UMAP when data has non-linear manifold structure; PCA when you need linear, deterministic, invertible compression.

Linear Discriminant Analysis (LDA)

Similarity

Both are linear dimensionality reduction methods

Key Difference

LDA uses class labels to find directions that maximize class separation (between-class variance / within-class variance). PCA ignores labels and maximizes total variance. LDA is supervised; PCA is unsupervised.

Choose When

LDA when labels are available and classification is the downstream task. PCA for unsupervised scenarios or when labels are unavailable.

Autoencoder (Neural)

Similarity

Both learn compressed representations

Key Difference

Autoencoders learn non-linear compressions. A linear autoencoder with MSE loss provably learns the same subspace as PCA. Neural autoencoders are more powerful but require more data and tuning.

Choose When

PCA for tabular data up to ~10K dimensions. Autoencoders for images, sequences, or when non-linear compression is needed.

Property	PCA	t-SNE	UMAP	LDA
Linear	✓ Yes	✗ No	✗ No	✓ Yes
Supervised	✗ No	✗ No	✗ No	✓ Yes
Invertible	✓ Yes	✗ No	Partial	✓ Yes
Any k	✓ Yes	2-3 only	2-3 best	≤ C-1
Speed (large d)	⚡ Fast	🐢 Slow	🚀 Fast	⚡ Fast
Handles non-linear	✗ No	✓ Yes	✓ Yes	✗ No

Choose Principal Component Analysis when:

Features are correlated, you need linear/invertible reduction, you want a preprocessing step before a downstream model, or you need fast visualization of high-dimensional data.

Evaluation

Explained Variance Ratio (Cumulative)

Primary metric for choosing k. A cumulative ratio of 0.95 means 5% of variance is discarded. Domain-specific: neuroimaging retains 99%, image compression retains 80%.

Target: > 0.90 for most applications; > 0.95 for conservative compression

Reconstruction MSE

Average squared deviation between original and reconstructed values. Low MSE = good compression. Measure this on held-out test data (not training data).

Target: Domain-dependent — compare to variance of raw features

Downstream Task Performance

The real measure of PCA quality: does compressing improve or hurt the final task? Sweep k from 1 to d and plot CV performance vs. k.

Target: Performance should plateau near optimal k, drop for too few PCs

KL Divergence (for generative applications)

Measures distributional difference between original and PCA-compressed data. Important when PCA is used in generative pipelines.

Target: Smaller is better; baseline: 0 means perfect reconstruction

Evaluation Process

01.1. Plot scree plot (individual and cumulative explained variance ratios).
02.2. Identify the elbow in the scree plot — the k after which each additional PC explains < 1% variance.
03.3. Measure reconstruction MSE on test data at chosen k.
04.4. If PCA is a preprocessing step: sweep k values and evaluate downstream model performance via cross-validation.
05.5. Inspect top 3-5 component loading vectors — do they correspond to interpretable patterns in the original features?
06.6. Check that the transformation is fit only on training data, then applied to test data.

Evaluation Traps

▸Using explained variance ratio as the only criterion — the optimal k for reconstruction differs from the optimal k for downstream classification.
▸Fitting PCA on the whole dataset (train + test) before train/test split — this is data leakage that inflates test performance.
▸Comparing absolute reconstruction MSE values without normalizing by feature variance — a high MSE in raw pixel units may be fine if pixel variance is also high.
▸Forgetting that PCA component signs are arbitrary — PC1 from one fit may be the negative of PC1 from another fit on the same data (due to SVD implementation details).

Real-World Interpretation Example

Gene expression dataset: 500 samples, 20,000 genes. After StandardScaler, PCA with k=50 explains 73% of variance. Reconstruction MSE on test = 0.84 (vs. feature variance ≈ 1.0 — so 84% of structure preserved). Downstream survival model (Cox regression) with all 20K genes: C-index = 0.61. With top 50 PCs: C-index = 0.67. PCA both compressed and improved the model by removing noise dimensions.

Common Mistakes

Students

×Not centering the data before computing the covariance matrix — centering is mandatory, the formula C = XᵀX/(n-1) assumes X is already zero-mean.
×Confusing eigenvalues with explained variance ratios — eigenvalues are absolute variance amounts; divide by their sum to get proportions.
×Thinking PCA 'removes features' — it doesn't remove any original feature; it creates new features (linear combinations) and you choose how many new ones to keep.
×Applying PCA to binary or one-hot encoded features without understanding the consequence — PCA on binary features doesn't preserve binary structure in components.

Developers

×Calling pca.fit_transform(X_test) instead of pca.transform(X_test) — refitting on test data leaks its distribution into the transformation.
×Not wrapping PCA and downstream model in a sklearn Pipeline — risks fitting the scaler or PCA on test data in a loop.
×Using PCA with n_components > min(n_samples, n_features) — sklearn raises an error, but attempting too large k relative to data size is conceptually wrong anyway.
×Forgetting to invert the StandardScaler when reconstructing data from PCA for interpretability.

In Interviews

×Saying 'PCA maximizes correlation between components' — PCA maximizes variance (not correlation) and ensures zero correlation between components.
×Confusing eigenvectors of C with the projected data — eigenvectors (components_) are direction vectors in original feature space; projected data (Z) is the actual reduced-dimension representation.
×Not knowing that PCA requires centering — a very common interview question.
×Saying 'PCA is supervised' — PCA is entirely unsupervised. It does not use label information.

Real Projects

×Standardizing features that are already on the same scale (like pixel values 0-255) — sometimes unnecessary scaling distorts PCA results.
×Not checking whether PCA actually helps downstream task performance before deploying — sometimes raw features outperform PCA-transformed features.
×Using PCA on time series without accounting for temporal structure — shuffling samples doesn't affect PCA, but time series have autocorrelation that PCA ignores.
×Applying PCA transformation from a model trained on old data distribution to new data without monitoring for distribution shift in PC space.

Core ML Thinking Lens

What kind of bias does this model have?

Bias depends on distance and shape assumptions in feature space.

What kind of variance does it have?

Variance increases when cluster structure is unstable or high-dimensional.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers scaled features with meaningful geometric distance.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

PCA finds orthogonal directions of maximum variance (principal components) via eigendecomposition of the covariance matrix
ALWAYS center data (subtract mean) before PCA; also standardize if features have different scales
Covariance matrix: C = XᵀX/(n-1) where X is already centered
Projection: Z = XW where W contains the top-k eigenvectors as columns
Explained variance ratio of j-th PC = λⱼ / Σλᵢ; choose k at the scree plot elbow or where cumulative EVR ≥ 0.95
Reconstruction: X̂ = ZWᵀ + mean(X); reconstruction error = sum of discarded eigenvalues
PCA is linear, unsupervised, and invertible — use Kernel PCA / UMAP for non-linear structure

Critical Formulas

Covariance Matrix

Eigenvector Equation

Projection

Explained Variance Ratio

Reconstruction

Best For

✓Preprocessing correlated high-dimensional features before linear models
✓Visualization of high-dimensional data in 2D/3D
✓Noise filtering when signal is low-rank
✓Removing multicollinearity before regression

Avoid When

✗Data has non-linear manifold structure (use UMAP/t-SNE)
✗Labels are available and discrimination is the goal (use LDA)
✗Features are already uncorrelated or sparse (use TruncatedSVD for sparse)
✗Interpretability of features is required downstream

Interview Must-Know

★Derive why eigenvectors maximize projected variance (Lagrange multiplier argument)

★Explain the connection between PCA, SVD, and the covariance matrix

★Know why centering is required and what happens if you skip it

★Explain the scree plot and how to choose k

★Compare PCA vs. LDA vs. t-SNE: supervision, linearity, when to use each

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.