In Plain English
PCA finds the directions in your data where variance is maximized, then projects your data onto those directions. It's a rotation that reframes your data so the first axis captures the most spread, the second captures the next most, and so on — letting you drop later axes with minimal information loss.
Why It Exists
High-dimensional data is simultaneously expensive to compute on, hard to visualize, and prone to the curse of dimensionality. PCA was developed to find compact, uncorrelated representations that preserve as much statistical structure (variance) as possible.
Problem It Solves
Given a dataset with d features that are often correlated, find k < d orthogonal directions (principal components) that together explain the majority of variance in the data, enabling dimensionality reduction without discarding critical information.
Real-Life Analogy
"Imagine photographing a 3D object. Depending on the angle, you capture more or less of the object's shape. PCA finds the camera angle from which the object casts the most spread-out shadow — that 2D shadow is your compressed representation. It throws away depth (a third dimension), but you chose the angle where depth was least important."
When To Use
- Features are highly correlated and you want to decorrelate them before modeling
- You need to visualize high-dimensional data in 2D or 3D
- Training is slow due to too many features and you need to reduce dimensionality
- You're building a feature extraction pipeline for images or text embeddings
- Removing noise from data where signal is concentrated in a low-dimensional subspace
- Before clustering (e.g., k-means) to avoid the curse of dimensionality
When NOT To Use
- Features are already uncorrelated — PCA provides no benefit and costs interpretability
- You need interpretable features — principal components are linear combinations with no semantic meaning
- Data has non-linear structure (use t-SNE, UMAP, or autoencoders instead)
- Your dataset is tiny (< 50 samples) — PCA components are unreliable with sparse data
- Target variable matters for compression — PCA is unsupervised and ignores labels (use LDA instead)
Every dataset lives in a d-dimensional space, but the data usually doesn't fill that space uniformly — it's stretched in some directions and compressed in others. PCA identifies those stretched directions. Mathematically, it solves: 'Find the unit vector w such that the variance of the projection Xw is maximized.' This vector is the first principal component.
After finding the first component, PCA finds the second: the direction of maximum remaining variance, constrained to be orthogonal (perpendicular) to the first. Then the third, orthogonal to both previous ones. And so on. The result is a new coordinate system — a rotation of the original — where axes are uncorrelated and ordered by importance.
The magic is that in practice, most real datasets have their variance concentrated in far fewer dimensions than the raw feature count. Image patches live on a low-dimensional manifold. Stock returns are driven by a handful of macro factors. Gene expression is controlled by a few regulatory pathways. PCA exposes this hidden low-dimensionality.
The Metaphor
"Imagine a cloud of points shaped like a squashed football (American football / rugby ball). The football points mostly left-right (high variance axis) and barely top-to-bottom (low variance axis). PCA finds the 'long axis' of the football and the 'short axis.' If you project every point onto just the long axis, you lose very little information — the short axis was barely doing anything. That's PCA: find the long axis, project, discard the short axis."
Beginner Mental Model
Think of PCA as rotating your data to find the 'most spread-out' direction, calling that the first axis (PC1), then finding the next most spread-out direction that's perpendicular to PC1 (that's PC2), and so on. Once you've found these new axes, you can keep only the first few — the ones that explain most variance — and get a compressed, decorrelated dataset.
Formal Definition
Given a centered data matrix X ∈ ℝⁿˣᵈ (n samples, d features, zero mean), PCA finds an orthonormal basis W = [w₁, w₂, ..., wₖ] ∈ ℝᵈˣᵏ where each wⱼ is an eigenvector of the sample covariance matrix C = XᵀX/(n-1). The columns of W are ordered by descending eigenvalue λ₁ ≥ λ₂ ≥ ... ≥ λₖ. The projected (reduced) data is Z = XW ∈ ℝⁿˣᵏ.
Key Terms
- Principal Component (PC)
- A direction in feature space (a d-dimensional unit vector) along which variance is maximized subject to orthogonality with all previous PCs. The j-th PC is the j-th eigenvector of the covariance matrix.
- Eigenvector
- A vector v such that Mv = λv for a matrix M. In PCA, eigenvectors of C define the principal component directions; they don't change direction when the covariance matrix acts on them, only scale.
- Eigenvalue (λⱼ)
- The scalar associated with an eigenvector. In PCA, λⱼ equals the variance of the data projected onto the j-th principal component. Larger eigenvalue = more variance explained.
- Covariance Matrix (C)
- A d×d symmetric positive semi-definite matrix where Cᵢⱼ = cov(xᵢ, xⱼ). Its diagonal entries are variances; off-diagonal entries are covariances. PCA diagonalizes C in the PC basis.
- Explained Variance Ratio
- The fraction of total variance explained by the j-th PC: λⱼ / Σᵢλᵢ. Summing the top-k ratios gives the total variance retained by a k-component PCA.
- Scree Plot
- A bar chart or line plot of eigenvalues (or explained variance ratios) in descending order. The 'elbow' point — where the curve bends — suggests the optimal number of components to retain.
- Reconstruction
- Approximating the original data from reduced components: X̂ = ZWᵀ. Reconstruction error = ||X - X̂||²_F = sum of discarded eigenvalues × n.
- Loading Vector
- The weights in a principal component direction — each element of wⱼ tells you how much the corresponding original feature contributes to the j-th PC.
Step-by-Step Working
- 1. Center the data: subtract the column mean from each feature. X_centered = X - mean(X, axis=0). (Optional: also standardize by dividing by std if features are on different scales.)
- 2. Compute the sample covariance matrix: C = XᵀX / (n - 1). Shape: d×d.
- 3. Compute eigenvectors and eigenvalues of C: C = VΛVᵀ where V contains eigenvectors as columns and Λ is diagonal with eigenvalues.
- 4. Sort eigenvectors by descending eigenvalue: λ₁ ≥ λ₂ ≥ ... ≥ λd.
- 5. Choose k: inspect scree plot, select k components explaining ≥ 85–95% of variance.
- 6. Form projection matrix W = [v₁, v₂, ..., vₖ] ∈ ℝᵈˣᵏ (first k eigenvectors).
- 7. Project data: Z = X_centered @ W. Shape: n×k. These are the principal component scores.
- 8. (Optional) Reconstruct: X̂ = Z @ Wᵀ + mean(X). Compute reconstruction error to validate k.
Inputs
Numeric feature matrix X ∈ ℝⁿˣᵈ. All features must be numeric and complete. Features should be standardized if they have different units or scales.
Outputs
Projected matrix Z ∈ ℝⁿˣᵏ (k << d) containing principal component scores. Also: component vectors W ∈ ℝᵈˣᵏ, explained variance ratios, and optional reconstruction X̂.
Model Assumptions
Important Edge Cases
- ▸d > n (more features than samples): The covariance matrix C (d×d) has rank ≤ n-1 — at most n-1 non-zero eigenvalues. Use SVD of X directly rather than eigendecomposition of C for numerical stability.
- ▸Perfectly correlated features: covariance matrix is singular. Some eigenvalues are zero — the corresponding PCs explain 0 variance and should be discarded.
- ▸All features have same variance: all eigenvalues are equal — there's no dominant direction. All PCs explain equal variance. PCA rotation is arbitrary in this case.
- ▸k = d: No dimensionality reduction — retaining all PCs is equivalent to a pure rotation (lossless). Total explained variance = 100%.
Role in the ML Pipeline
PCA typically sits between feature engineering and model training. After encoding and scaling features, PCA reduces dimensionality before passing data to classifiers, regressors, or clustering algorithms. It can also be used at the end of a pipeline for visualization (project to 2D after training).
Data Preprocessing
- 01.Handle missing values: impute before PCA — any NaN causes PCA to fail. KNN imputation or median imputation are common choices.
- 02.Encode categorical features: PCA operates on numeric data. One-hot encode categoricals first.
- 03.StandardScaler: CRITICAL before PCA unless all features share the same unit. PCA finds directions of maximum variance; unscaled features with large absolute ranges will dominate.
- 04.Check for outliers: extreme outliers can pull principal components. Consider winsorizing (capping) outliers or using Robust PCA.
- 05.Remove zero-variance features: features with variance zero (constants) contribute nothing and can cause numerical issues.
Training Process
- 01.Fit PCA on training data only (call fit on X_train). This captures the training data's covariance structure.
- 02.Transform both train and test sets using the same fitted PCA (call transform). Never refit on test — that would be data leakage.
- 03.Inspect explained_variance_ratio_: plot cumulative sum to choose k (find the elbow or cross the 95% threshold).
- 04.Validate k: fit downstream model at several k values, compare cross-validation performance. PCA+model is a single pipeline in sklearn.
- 05.Examine component loadings (pca.components_) to understand what each PC represents.
- 06.Compute reconstruction error at chosen k to quantify information loss: ||X_train - X̂_train||_F².
Hyperparameters
Name
n_components (k)
Description
Number of principal components to retain. Most important hyperparameter.
Typical
Choose via scree plot or cross-validation. Often 10–50 for images, 2–3 for visualization.
Name
svd_solver
Description
Algorithm used to compute SVD. 'full' uses LAPACK; 'randomized' uses Halko et al. approximation.
Typical
'randomized' for n > 500 and d > 500; 'full' for small data
Name
whiten
Description
If True, divides each PC by its standard deviation, making components unit variance.
Typical
False (default). True useful before algorithms like ICA or k-means.
Implementation Checklist
- 1
pip install scikit-learn numpy matplotlib - 2
Load data and check for missing values, dtypes - 3
Fit StandardScaler on training features, transform train and test - 4
Fit PCA on X_train_scaled (start with n_components=d or n_components=0.95 for auto-selection) - 5
Plot cumulative explained variance — pick k at elbow or 95% threshold - 6
Transform data: X_train_pca = pca.transform(X_train_scaled) - 7
Feed X_train_pca into downstream model. Wrap in sklearn Pipeline for cleanliness.
1import numpy as np
2import matplotlib.pyplot as plt
3
4class PCA:
5 def __init__(self, n_components: int):
6 self.n_components = n_components
7 self.components_ = None # W: shape (n_components, d)
8 self.explained_variance_ = None # eigenvalues
9 self.explained_variance_ratio_ = None
10 self.mean_ = None
11
12 def fit(self, X: np.ndarray) -> "PCA":
13 n_samples, n_features = X.shape
14
15 # Step 1: Center
16 self.mean_ = X.mean(axis=0) # (d,)
17 X_c = X - self.mean_ # (n, d)
18
19 # Step 2: Covariance matrix
20 # Use (n-1) denominator (unbiased)
21 C = (X_c.T @ X_c) / (n_samples - 1) # (d, d)
22
23 # Step 3: Eigendecomposition
24 eigenvalues, eigenvectors = np.linalg.eigh(C)
25 # eigh returns ascending order — reverse to descending
26 idx = np.argsort(eigenvalues)[::-1]
27 eigenvalues = eigenvalues[idx]
28 eigenvectors = eigenvectors[:, idx] # (d, d)
29
30 # Step 4: Store top-k
31 self.components_ = eigenvectors[:, :self.n_components].T # (k, d)
32 self.explained_variance_ = eigenvalues[:self.n_components]
33 total_var = eigenvalues.sum()
34 self.explained_variance_ratio_ = self.explained_variance_ / total_var
35 return self
36
37 def transform(self, X: np.ndarray) -> np.ndarray:
38 X_c = X - self.mean_
39 return X_c @ self.components_.T # (n, k)
40
41 def fit_transform(self, X: np.ndarray) -> np.ndarray:
42 return self.fit(X).transform(X)
43
44 def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
45 return Z @ self.components_ + self.mean_ # (n, d)
46
47 def reconstruction_error(self, X: np.ndarray) -> float:
48 Z = self.transform(X)
49 X_hat = self.inverse_transform(Z)
50 return float(np.mean((X - X_hat) ** 2)) # MSE
51
52
53# ── Demo: 3D data → 2D ───────────────────────────────────────────────────────
54np.random.seed(42)
55n = 200
56# True signal lives on a 2D plane embedded in 3D
57t = np.random.randn(n, 2)
58noise = np.random.randn(n, 3) * 0.1
59A = np.array([[1, 0.8, 0.2],
60 [0, 0.6, 0.9]]) # mixing matrix
61X = t @ A + noise # (200, 3)
62
63pca = PCA(n_components=2)
64Z = pca.fit_transform(X)
65
66print("Original shape:", X.shape) # (200, 3)
67print("Reduced shape: ", Z.shape) # (200, 2)
68print("Explained variance ratios:", pca.explained_variance_ratio_.round(4))
69print("Cumulative variance: {:.1f}%".format(
70 pca.explained_variance_ratio_.sum() * 100))
71print("Reconstruction MSE:", round(pca.reconstruction_error(X), 6))
72
73# Scree plot
74from sklearn.decomposition import PCA as skPCA
75pca_full = skPCA().fit(X)
76plt.figure(figsize=(8, 3))
77plt.subplot(1, 2, 1)
78plt.bar(range(1, 4), pca_full.explained_variance_ratio_ * 100)
79plt.xlabel("Principal Component"); plt.ylabel("% Variance Explained")
80plt.title("Scree Plot")
81plt.subplot(1, 2, 2)
82plt.scatter(Z[:, 0], Z[:, 1], alpha=0.6)
83plt.xlabel("PC1"); plt.ylabel("PC2")
84plt.title("2D Projection")
85plt.tight_layout(); plt.savefig("pca_demo.png", dpi=120)Sample Input
X_train: shape (1000, 64) — 64-pixel flattened image patches, values in [0, 255]. After StandardScaler.
Sample Output
Z_train: shape (1000, 29) — 29 principal components retaining 95% variance. explained_variance_ratio_: [0.148, 0.102, 0.078, ...]. Reconstruction MSE: 0.023.
Key Implementation Insights
- →ALWAYS scale before PCA when features have different units. Without scaling, the feature with the largest variance (often just the one measured in bigger numbers) will dominate all principal components.
- →Use n_components=0.95 in sklearn's PCA to automatically select k so that 95% of variance is retained — much more robust than hardcoding k.
- →Fit PCA on training data only. Transform both train and test. Fitting on all data leaks test distribution info into the transformation.
- →Principal components can have arbitrary sign — PC1 in one run might point opposite to another run. Signs are not meaningful; only directions and magnitudes matter.
- →For n >> d (many more samples than features), use PCA on the covariance matrix (d×d). For d >> n, use SVD of X directly — sklearn does this automatically via svd_solver='auto'.
Common Implementation Mistakes
- ✗Fitting StandardScaler or PCA on the combined (train+test) data before splitting — this is data leakage.
- ✗Not centering data before computing the covariance matrix — the covariance formula assumes zero mean.
- ✗Interpreting PCA components as original features — they're linear combinations with no direct semantic meaning.
- ✗Using PCA to remove outliers — PCA is sensitive to outliers. Use Robust PCA (sklearn has no built-in — use pyRMT or manual RPCA) for outlier-contaminated data.
- ✗Forgetting to apply the same transformation to test data — always use pca.transform(X_test), never pca.fit_transform(X_test).
High-Dimensional Tabular (d > 100)
PCA shines when many features are correlated. It compresses correlated features into uncorrelated PCs, dramatically reducing input size while retaining structure.
Image Data (pixels)
Natural images have strong correlations between adjacent pixels. PCA (Eigenfaces) reduces 10,000-pixel faces to ~100 dimensions while preserving identity-discriminating structure.
Small Dataset (< 100 rows)
With few samples, estimated covariance matrix is unreliable. Eigenvectors computed from n=50 samples in d=100 dimensions are mostly noise.
Time Series Data
PCA can extract common factors across multiple time series (e.g., stock returns). But temporal structure (autocorrelation) is ignored — PCA treats rows as i.i.d.
Sparse Data (NLP bag-of-words)
Truncated SVD (sklearn's TruncatedSVD) is the PCA variant for sparse matrices — equivalent to PCA but skips centering (centering destroys sparsity). Also known as LSA.
Non-Linear Manifold Data (Swiss roll, circles)
PCA can only find linear structure. Data on non-linear manifolds (curved surfaces, concentric rings) will not compress well — PCA preserves global distances, not local manifold structure.
Interactive: Projection Direction and Explained Variance
Scree Plot — Explained Variance per Component
Bar chart showing how much variance each principal component explains. Look for the 'elbow' — the point where adding more components gives diminishing returns. The red dashed line shows the cumulative threshold for 95% variance.
2D PCA Projection — Iris Dataset
Each point is a flower sample projected onto its first two principal components. Colors represent the three species. Well-separated clusters indicate that PC1 and PC2 capture the species-discriminating variance, even though PCA used no label information.
● Data points · — Regression line (ŷ = -0.08x + 0.37)
Reconstruction Error vs. Number of Components
Mean squared reconstruction error decreases as more components are retained. The sharp drop in the first few components confirms most variance is concentrated there. After the elbow, adding components provides diminishing error reduction.
Gradient descent convergence — MSE decreasing over iterations
Advantages
Removes multicollinearity
Principal components are by construction orthogonal (uncorrelated). Any downstream linear model applied to PC scores will never suffer from multicollinearity — VIFs are exactly 1 for all components.
Reduces overfitting in downstream models
Compressing d features to k << d principal components reduces the effective number of model parameters. The regularization effect is strongest when discarded PCs contain mostly noise.
Noise reduction
When signal is concentrated in the top PCs and noise is spread across all directions uniformly, discarding bottom PCs removes noise. This is why PCA-preprocessed data can give better downstream model accuracy than raw features.
Fast and exact (closed-form)
PCA has a unique, analytically computed solution. No iterative optimization, no local minima, no random initialization. For n < 100K and d < 10K, SVD computation completes in seconds.
Excellent for visualization
Projecting to 2D or 3D with PCA is the fastest way to sanity-check clustering structure, class separability, or outliers in a high-dimensional dataset. Standard first step in any EDA workflow.
Information-theoretically optimal for linear compression
Under Gaussian assumptions, PCA is the optimal linear autoencoder — no other linear projection of dimension k loses less information (in the MSE sense). The Eckart-Young theorem formalizes this.
Limitations
Linear only — misses non-linear structure
PCA can only find linear subspaces. Swiss roll, concentric circles, or any manifold with curvature will not be compressed effectively. UMAP and t-SNE handle these but are slower and stochastic.
Destroys interpretability
Principal components are linear combinations of ALL original features. You can no longer say 'this variable caused this prediction.' PC1 might mix age + income + education — a black box direction.
Sensitive to feature scaling
Without StandardScaler, features with large absolute variance dominate all PCs. This is a silent failure mode — you think you're doing PCA on all features but you're effectively PCA-ing only the largest-scale feature.
Discards low-variance directions that may be class-discriminating
PCA is unsupervised — it ignores labels. A direction with low variance might be exactly what separates classes. Use Linear Discriminant Analysis (LDA) when labels are available and discrimination is the goal.
Not robust to outliers
Outliers inflate covariance estimates and pull principal component directions toward themselves. A single extreme point can dramatically rotate PC1 away from the true signal direction.
Eigenfaces — face recognition preprocessing
Each 100×100 face image is a 10,000-dimensional vector. PCA reduces this to ~150 principal components (Eigenfaces) that capture lighting, orientation, and identity. A nearest-neighbor classifier in this 150D space achieves competitive face recognition accuracy.
Factor model construction for portfolio optimization
PCA on a correlation matrix of 500 stock returns extracts market factor (PC1, explains ~40% variance), industry sectors (PC2-PC5), and style factors. These factors drive risk model construction in BlackRock, AQR, and most quant hedge funds.
Population stratification in GWAS studies
Genome-wide association studies have ~1M SNP features per person. PCA of the genotype matrix reveals population clusters (European, African, Asian ancestry) as distinct regions in PC1-PC2 space. These PCs are used as covariates to control for ancestry confounding.
Latent Semantic Analysis (LSA)
TF-IDF document-term matrix (sparse, 50K×100K) is decomposed via truncated SVD (equivalent to PCA without centering) into 300-dimensional topic vectors. Documents with similar topics cluster in this space regardless of exact word choice.
Process monitoring with multivariate control charts
Semiconductor fabrication involves 100+ process parameters measured per wafer. PCA reduces these to ~5 PCs representing 'normal variation modes.' A wafer deviating from normal in PC space triggers an alert — much more sensitive than univariate control charts.
PCA is the workhorse of linear dimensionality reduction. Here's how it compares to the most important alternatives:
t-SNE
Similarity
Both reduce dimensionality for visualization
Key Difference
t-SNE is non-linear, preserves local neighborhood structure, and is only suitable for 2D/3D visualization (not general compression). PCA preserves global variance structure and can produce any k dimensions.
Choose When
t-SNE for visualization of cluster structure; PCA for general compression and preprocessing.
UMAP
Similarity
Both project high-dimensional data to lower dimensions
Key Difference
UMAP preserves both local and global structure better than t-SNE. It's much faster and produces stable results usable for downstream tasks. But it's non-linear and stochastic.
Choose When
UMAP when data has non-linear manifold structure; PCA when you need linear, deterministic, invertible compression.
Linear Discriminant Analysis (LDA)
Similarity
Both are linear dimensionality reduction methods
Key Difference
LDA uses class labels to find directions that maximize class separation (between-class variance / within-class variance). PCA ignores labels and maximizes total variance. LDA is supervised; PCA is unsupervised.
Choose When
LDA when labels are available and classification is the downstream task. PCA for unsupervised scenarios or when labels are unavailable.
Autoencoder (Neural)
Similarity
Both learn compressed representations
Key Difference
Autoencoders learn non-linear compressions. A linear autoencoder with MSE loss provably learns the same subspace as PCA. Neural autoencoders are more powerful but require more data and tuning.
Choose When
PCA for tabular data up to ~10K dimensions. Autoencoders for images, sequences, or when non-linear compression is needed.
| Property | PCA | t-SNE | UMAP | LDA |
|---|---|---|---|---|
| Linear | ✓ Yes | ✗ No | ✗ No | ✓ Yes |
| Supervised | ✗ No | ✗ No | ✗ No | ✓ Yes |
| Invertible | ✓ Yes | ✗ No | Partial | ✓ Yes |
| Any k | ✓ Yes | 2-3 only | 2-3 best | ≤ C-1 |
| Speed (large d) | ⚡ Fast | 🐢 Slow | 🚀 Fast | ⚡ Fast |
| Handles non-linear | ✗ No | ✓ Yes | ✓ Yes | ✗ No |
Choose Principal Component Analysis when:
Features are correlated, you need linear/invertible reduction, you want a preprocessing step before a downstream model, or you need fast visualization of high-dimensional data.
Explained Variance Ratio (Cumulative)
Primary metric for choosing k. A cumulative ratio of 0.95 means 5% of variance is discarded. Domain-specific: neuroimaging retains 99%, image compression retains 80%.
Target: > 0.90 for most applications; > 0.95 for conservative compression
Reconstruction MSE
Average squared deviation between original and reconstructed values. Low MSE = good compression. Measure this on held-out test data (not training data).
Target: Domain-dependent — compare to variance of raw features
Downstream Task Performance
The real measure of PCA quality: does compressing improve or hurt the final task? Sweep k from 1 to d and plot CV performance vs. k.
Target: Performance should plateau near optimal k, drop for too few PCs
KL Divergence (for generative applications)
Measures distributional difference between original and PCA-compressed data. Important when PCA is used in generative pipelines.
Target: Smaller is better; baseline: 0 means perfect reconstruction
Evaluation Process
- 01.1. Plot scree plot (individual and cumulative explained variance ratios).
- 02.2. Identify the elbow in the scree plot — the k after which each additional PC explains < 1% variance.
- 03.3. Measure reconstruction MSE on test data at chosen k.
- 04.4. If PCA is a preprocessing step: sweep k values and evaluate downstream model performance via cross-validation.
- 05.5. Inspect top 3-5 component loading vectors — do they correspond to interpretable patterns in the original features?
- 06.6. Check that the transformation is fit only on training data, then applied to test data.
Evaluation Traps
- ▸Using explained variance ratio as the only criterion — the optimal k for reconstruction differs from the optimal k for downstream classification.
- ▸Fitting PCA on the whole dataset (train + test) before train/test split — this is data leakage that inflates test performance.
- ▸Comparing absolute reconstruction MSE values without normalizing by feature variance — a high MSE in raw pixel units may be fine if pixel variance is also high.
- ▸Forgetting that PCA component signs are arbitrary — PC1 from one fit may be the negative of PC1 from another fit on the same data (due to SVD implementation details).
Real-World Interpretation Example
Gene expression dataset: 500 samples, 20,000 genes. After StandardScaler, PCA with k=50 explains 73% of variance. Reconstruction MSE on test = 0.84 (vs. feature variance ≈ 1.0 — so 84% of structure preserved). Downstream survival model (Cox regression) with all 20K genes: C-index = 0.61. With top 50 PCs: C-index = 0.67. PCA both compressed and improved the model by removing noise dimensions.
Students
- ×Not centering the data before computing the covariance matrix — centering is mandatory, the formula C = XᵀX/(n-1) assumes X is already zero-mean.
- ×Confusing eigenvalues with explained variance ratios — eigenvalues are absolute variance amounts; divide by their sum to get proportions.
- ×Thinking PCA 'removes features' — it doesn't remove any original feature; it creates new features (linear combinations) and you choose how many new ones to keep.
- ×Applying PCA to binary or one-hot encoded features without understanding the consequence — PCA on binary features doesn't preserve binary structure in components.
Developers
- ×Calling pca.fit_transform(X_test) instead of pca.transform(X_test) — refitting on test data leaks its distribution into the transformation.
- ×Not wrapping PCA and downstream model in a sklearn Pipeline — risks fitting the scaler or PCA on test data in a loop.
- ×Using PCA with n_components > min(n_samples, n_features) — sklearn raises an error, but attempting too large k relative to data size is conceptually wrong anyway.
- ×Forgetting to invert the StandardScaler when reconstructing data from PCA for interpretability.
In Interviews
- ×Saying 'PCA maximizes correlation between components' — PCA maximizes variance (not correlation) and ensures zero correlation between components.
- ×Confusing eigenvectors of C with the projected data — eigenvectors (components_) are direction vectors in original feature space; projected data (Z) is the actual reduced-dimension representation.
- ×Not knowing that PCA requires centering — a very common interview question.
- ×Saying 'PCA is supervised' — PCA is entirely unsupervised. It does not use label information.
Real Projects
- ×Standardizing features that are already on the same scale (like pixel values 0-255) — sometimes unnecessary scaling distorts PCA results.
- ×Not checking whether PCA actually helps downstream task performance before deploying — sometimes raw features outperform PCA-transformed features.
- ×Using PCA on time series without accounting for temporal structure — shuffling samples doesn't affect PCA, but time series have autocorrelation that PCA ignores.
- ×Applying PCA transformation from a model trained on old data distribution to new data without monitoring for distribution shift in PC space.
What kind of bias does this model have?
Bias depends on distance and shape assumptions in feature space.
What kind of variance does it have?
Variance increases when cluster structure is unstable or high-dimensional.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers scaled features with meaningful geometric distance.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- PCA finds orthogonal directions of maximum variance (principal components) via eigendecomposition of the covariance matrix
- ALWAYS center data (subtract mean) before PCA; also standardize if features have different scales
- Covariance matrix: C = XᵀX/(n-1) where X is already centered
- Projection: Z = XW where W contains the top-k eigenvectors as columns
- Explained variance ratio of j-th PC = λⱼ / Σλᵢ; choose k at the scree plot elbow or where cumulative EVR ≥ 0.95
- Reconstruction: X̂ = ZWᵀ + mean(X); reconstruction error = sum of discarded eigenvalues
- PCA is linear, unsupervised, and invertible — use Kernel PCA / UMAP for non-linear structure
Critical Formulas
Best For
- ✓Preprocessing correlated high-dimensional features before linear models
- ✓Visualization of high-dimensional data in 2D/3D
- ✓Noise filtering when signal is low-rank
- ✓Removing multicollinearity before regression
Avoid When
- ✗Data has non-linear manifold structure (use UMAP/t-SNE)
- ✗Labels are available and discrimination is the goal (use LDA)
- ✗Features are already uncorrelated or sparse (use TruncatedSVD for sparse)
- ✗Interpretability of features is required downstream
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.