ML Atlas

Linear Regression

Fit a line. Predict a number. Understand everything.

BeginnerSupervisedRegressionMath Heavy
28 min read
Basic calculus (derivatives)Matrix multiplicationStatistical concepts: mean, variance
  • House price prediction engines (Zillow, Redfin)
  • Ad spend → revenue forecasting at every major tech company
  • Medical dosage estimation and clinical trial analysis
  • Feature importance baseline in Kaggle competitions
  • Baseline model in any regression pipeline before trying complex models
01

In Plain English

Linear Regression finds the best-fit straight line through your data so you can predict a continuous number (like price or temperature) from one or more input features.

Why It Exists

Humans noticed that many real-world relationships are approximately linear — more hours studied → higher grade, more square footage → higher price. We needed a principled mathematical way to find and quantify that relationship.

Problem It Solves

Given a dataset of (input, output) pairs, find the linear equation that best explains the output. Then use that equation to predict outputs for new, unseen inputs.

Real-Life Analogy

"Imagine you're buying groceries. You notice apples cost roughly ₹5 each. You have 7 apples — you mentally multiply 7 × 5 = ₹35. Linear regression is exactly that: finding that '5 per apple' constant from past receipts, then using it to predict future bills."

When To Use

  • Target variable is continuous (price, temperature, salary, demand)
  • You suspect a roughly linear relationship between features and target
  • You need an interpretable model (coefficients have clear meaning)
  • Dataset is small to medium sized and noise is Gaussian
  • You want a fast baseline before trying complex models
  • Feature importance or effect size matters as much as accuracy

When NOT To Use

  • Target is categorical (use Logistic Regression or tree-based methods)
  • Relationship is clearly non-linear (polynomial, exponential, etc.)
  • Features have severe multicollinearity without regularization
  • You have massive outliers that haven't been cleaned
  • Number of features >> number of samples (use Ridge/Lasso instead)
  • You need to capture complex interactions without feature engineering
02

Imagine plotting a scatter of points on graph paper: x-axis is house size, y-axis is price. You eyeball a line through the cloud of points that 'fits best'. Linear regression formalizes what 'best' means: minimize the total squared distance between each point and the line.

Every predicted value is just a weighted sum of input features. The model learns the weights. Once learned, prediction is a single dot product — O(features) time, essentially free.

The loss surface of linear regression is a perfect convex bowl with one global minimum. There's no local minima trap. You can solve it analytically (closed-form OLS) or iteratively (gradient descent). Both give the same answer.

The Metaphor

"Think of it like adjusting a seesaw: you have data points sitting at different heights along the beam. Linear regression finds the exact fulcrum position and angle that minimizes how far each person is from sitting level."

Beginner Mental Model

Start with y = mx + b from school. Linear regression finds the exact m (slope) and b (intercept) that makes this equation best fit all your data points simultaneously. For multiple features, it extends to y = w1·x1 + w2·x2 + ... + b — one coefficient per feature.

03

Given a dataset {(x⁽ⁱ⁾, y⁽ⁱ⁾)}ᵢ₌₁ⁿ where x⁽ⁱ⁾ ∈ ℝᵈ and y⁽ⁱ⁾ ∈ ℝ, linear regression models the conditional expectation E[Y|X] = Xw + b, finding parameters w ∈ ℝᵈ and b ∈ ℝ that minimize the Mean Squared Error (MSE) loss: L(w,b) = (1/n) Σᵢ (y⁽ⁱ⁾ - (w·x⁽ⁱ⁾ + b))².

Coefficient / Weight (w)
The slope of the hyperplane in a particular feature dimension. Tells you how much y changes per unit increase in xⱼ, all else equal.
Intercept / Bias (b)
The predicted value of y when all features are zero. Often not meaningful alone but crucial for the model's calibration.
Residual
The difference between actual y and predicted ŷ for a training sample: eᵢ = yᵢ - ŷᵢ. Linear regression minimizes the sum of squared residuals.
OLS (Ordinary Least Squares)
The analytical solution to linear regression that directly computes optimal w via the normal equations: w* = (XᵀX)⁻¹Xᵀy.
MSE (Mean Squared Error)
The loss function: average of squared residuals. Squaring penalizes large errors more, giving the model incentive to avoid big misses.
R² (Coefficient of Determination)
Measures what fraction of variance in y is explained by the model. R²=1 means perfect fit; R²=0 means model is no better than predicting the mean.
Homoscedasticity
Assumption that residuals have constant variance across all values of X. Violated by heteroscedastic data (e.g., variance increases with feature value).
  1. 1. Collect training data: n samples of (features x⁽ⁱ⁾, target y⁽ⁱ⁾).
  2. 2. Represent data as matrix X (n×d) and vector y (n×1).
  3. 3. Add a column of ones to X for the intercept term (or keep bias separate).
  4. 4. Define the prediction: ŷ = Xw.
  5. 5. Define the loss: L = (1/n)||y - Xw||².
  6. 6a. (OLS path) Solve analytically: w* = (XᵀX)⁻¹Xᵀy.
  7. 6b. (GD path) Initialize w=0, iterate: w ← w - α·∇L = w - (2/n)Xᵀ(Xw - y).
  8. 7. Prediction: ŷ_new = X_new · w*.

Feature matrix X ∈ ℝⁿˣᵈ (n samples, d features). Each feature should be numeric; categorical features must be encoded.

Continuous scalar prediction ŷ ∈ ℝ for each input sample.

01Linearity: The relationship between X and E[Y|X] is linear.
02Independence: Each training sample is independently drawn.
03Homoscedasticity: Residuals have constant variance (no heteroscedasticity).
04Normality of residuals: Residuals are approximately normally distributed (important for inference, not prediction).
05No perfect multicollinearity: Features are not exact linear combinations of each other (makes XᵀX invertible).
  • n < d (underdetermined system): XᵀX is not invertible. OLS has infinite solutions. Use Ridge regression.
  • Perfect multicollinearity: Two features are identical or one is a linear combo of others. XᵀX singular. Use Ridge or remove redundant features.
  • All targets same value: Model learns w=0, b=constant. R²=0 or undefined. Not a model failure, but a data issue.
  • Single feature with zero variance: Division by zero in normalization. Drop that feature.
04

Linear Regression typically sits at the end of the feature engineering pipeline, after data cleaning, encoding, scaling, and feature selection. It consumes numeric features and produces a real-valued prediction.

  • 01.Handle missing values: impute with mean/median or drop rows.
  • 02.Encode categoricals: one-hot encoding for nominal features, ordinal encoding when order matters.
  • 03.Feature scaling: StandardScaler or MinMaxScaler. Critical when using gradient descent or when comparing coefficients across features.
  • 04.Outlier treatment: Winsorize extreme values or use Huber loss variant. Linear regression is sensitive to outliers due to the squared loss.
  • 05.Check for multicollinearity: compute VIF (Variance Inflation Factor). Drop or combine features with VIF > 10.
  • 06.Feature engineering: Create polynomial features (x², x1·x2) if you suspect non-linear relationships.
  • 01.Split data: typically 80/20 or 70/30 train/test split, or use k-fold CV for small datasets.
  • 02.Fit model: call fit(X_train, y_train). Internally computes w* = (XᵀX)⁻¹Xᵀy via SVD or gradient descent.
  • 03.Inspect coefficients: ensure they have expected signs and magnitudes.
  • 04.Evaluate on validation set: compute MSE, RMSE, MAE, R².
  • 05.Diagnose residuals: plot residuals vs. fitted values (look for patterns), Q-Q plot (check normality).
  • 06.Iterate: refine features based on coefficient analysis and residual diagnosis.

fit_intercept

Whether to include a bias term b. Should almost always be True.

True

normalize / StandardScaler

Not a hyperparameter of the model itself, but a preprocessing choice that affects coefficient interpretability and gradient descent convergence.

StandardScaler before fitting

  1. 1pip install scikit-learn numpy pandas
  2. 2Load and explore data (df.info(), df.describe(), correlation heatmap)
  3. 3Preprocess: handle NaN, encode categoricals, scale numerics
  4. 4Train/test split: train_test_split(X, y, test_size=0.2, random_state=42)
  5. 5Instantiate and fit: model = LinearRegression(); model.fit(X_train, y_train)
  6. 6Predict: y_pred = model.predict(X_test)
  7. 7Evaluate: mean_squared_error, r2_score, plot residuals
05
06
python
1import numpy as np
2
3class LinearRegression:
4    def __init__(self, learning_rate=0.01, n_iterations=1000, method="ols"):
5        self.lr = learning_rate
6        self.n_iter = n_iterations
7        self.method = method  # "ols" or "gradient_descent"
8        self.weights = None
9        self.bias = None
10        self.loss_history = []
11
12    def fit(self, X, y):
13        n_samples, n_features = X.shape
14
15        if self.method == "ols":
16            # Closed-form: w* = (XᵀX)⁻¹ Xᵀy
17            # Add bias column of ones to X
18            X_b = np.c_[np.ones(n_samples), X]          # (n, d+1)
19            w_full = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
20            self.bias = w_full[0]
21            self.weights = w_full[1:]
22
23        elif self.method == "gradient_descent":
24            self.weights = np.zeros(n_features)
25            self.bias = 0.0
26
27            for _ in range(self.n_iter):
28                y_pred = X @ self.weights + self.bias   # (n,)
29                residuals = y_pred - y                  # (n,)
30
31                # Gradients of MSE
32                dw = (2 / n_samples) * X.T @ residuals  # (d,)
33                db = (2 / n_samples) * residuals.sum()   # scalar
34
35                self.weights -= self.lr * dw
36                self.bias    -= self.lr * db
37
38                # Track loss
39                mse = np.mean(residuals ** 2)
40                self.loss_history.append(mse)
41
42        return self
43
44    def predict(self, X):
45        return X @ self.weights + self.bias
46
47    def score(self, X, y):
48        y_pred = self.predict(X)
49        ss_res = np.sum((y - y_pred) ** 2)
50        ss_tot = np.sum((y - y.mean()) ** 2)
51        return 1 - ss_res / ss_tot  # R²
52
53
54# ── Demo ──────────────────────────────────────────────────────────────────────
55np.random.seed(42)
56X = np.random.randn(100, 2)                     # 100 samples, 2 features
57y = 3.0 * X[:, 0] - 1.5 * X[:, 1] + 2.0 + np.random.randn(100) * 0.5
58
59from sklearn.model_selection import train_test_split
60X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
61
62# OLS
63model_ols = LinearRegression(method="ols").fit(X_train, y_train)
64print(f"OLS weights:  {model_ols.weights.round(3)}")   # ≈ [3.0, -1.5]
65print(f"OLS bias:     {model_ols.bias:.3f}")            # ≈ 2.0
66print(f"OLS R²:       {model_ols.score(X_test, y_test):.4f}")
67
68# Gradient Descent
69from sklearn.preprocessing import StandardScaler
70scaler = StandardScaler()
71X_train_s = scaler.fit_transform(X_train)
72X_test_s  = scaler.transform(X_test)
73
74model_gd = LinearRegression(method="gradient_descent", learning_rate=0.1, n_iterations=500)
75model_gd.fit(X_train_s, y_train)
76print(f"GD  R²:       {model_gd.score(X_test_s, y_test):.4f}")
OLS uses np.linalg.pinv (pseudoinverse via SVD) which is numerically stable even when XᵀX is nearly singular. Gradient descent requires scaling first — without it, gradients explode for features with very different ranges.
X = [[1400, 3, 10], [900, 2, 25], [2100, 4, 5]]  # sqft, bedrooms, age
y = [280000, 175000, 420000]
Weights: [45000.2, 8200.5, -1200.3]  # per sqft, per bedroom, per year
Bias: 52000
Prediction for [1800, 3, 8]: $341,200
R² on test: 0.9312
  • np.linalg.pinv is safer than np.linalg.inv — it handles near-singular matrices via SVD.
  • Always scale features before gradient descent. Without scaling, features with large ranges dominate the gradient and learning rate tuning becomes impossible.
  • model.coef_ gives you feature importance in standardized space (after StandardScaler). Larger absolute coefficient = stronger effect.
  • Scikit-learn's LinearRegression uses LAPACK routines (dgelsd) which is numerically more stable than the raw normal equations.
  • Not scaling features before gradient descent — leads to divergence or extremely slow convergence.
  • Forgetting the intercept term (fit_intercept=False when you need it True).
  • Interpreting unscaled coefficients as feature importance — the scale of the feature distorts the coefficient.
  • Using R² alone for evaluation — high R² can coexist with terrible predictions if y range is wide.
  • Not checking residual plots — if residuals form a fan shape, your model has heteroscedasticity.
07
📊

Small Tabular Dataset (< 1K rows)

Excellent

Linear Regression shines here. OLS gives exact solution, training is instant, and the small sample size doesn't expose it to overfitting when d << n.

💡 Use cross-validation instead of a hold-out test set to maximize use of small data.
🗄️

Large Tabular Dataset (> 1M rows)

Excellent

OLS becomes expensive O(nd²) for large n — use mini-batch or stochastic gradient descent via SGDRegressor. Still fast in practice.

💡 sklearn.linear_model.SGDRegressor is the go-to for large-scale linear regression.
📉

Noisy Dataset

Context-Dependent

Moderately robust to Gaussian noise — that's exactly what OLS is optimal for (BLUE theorem). But outliers (heavy-tailed noise) cause major distortions.

💡 Use HuberRegressor or RANSAC for heavy-tailed noise. They're 'robust' variants of linear regression.
⚖️

Imbalanced Dataset

Good

Imbalance is a classification concern — linear regression targets continuous y, so this isn't directly applicable. However, outlier groups can dominate the loss.

💡 If certain y-value ranges are rare, consider weighted least squares (sample_weight parameter).
📐

High-Dimensional Dataset (d >> n)

Poor

XᵀX becomes singular, OLS solution is ill-defined. The model has infinite solutions and will overfit catastrophically.

💡 Always use Ridge (L2) or Lasso (L1) regularization when d approaches or exceeds n.
🌊

Highly Non-Linear Data

Poor

The model will underfit badly — it can only represent a hyperplane, not curves, clusters, or complex decision surfaces.

💡 Add polynomial features (PolynomialFeatures) for mild non-linearity. For strong non-linearity, switch to tree-based or neural models.
08

Interactive: Fit Line, Residuals, MSE, and Outlier Impact

MSE

4.96

MAE

1.88

Regression Line vs. Scatter Data

Shows how the fitted line sits among training points. Residuals are the vertical distances from each point to the line.

● Data points · — Regression line (ŷ = 2.45x + 1.01)

Residuals vs. Fitted Values

A well-behaved model shows residuals randomly scattered around zero (no pattern). Fan shapes → heteroscedasticity. Curves → non-linearity.

Random scatter around zero = good fit · Patterns = violated assumptions

Gradient Descent Loss Curve

MSE decreases over iterations as gradient descent converges. A healthy curve drops sharply then plateaus. Oscillation → learning rate too high.

Gradient descent convergence — MSE decreasing over iterations

09
  • Perfect interpretability

    Each coefficient directly tells you: 'One unit increase in feature j → wⱼ unit increase in y, holding all else constant.' No black box. Executives and regulators love this.

  • Blazing fast training

    OLS is O(nd²) — training a model with 100K samples and 10 features takes milliseconds. Even gradient descent converges in seconds.

  • No hyperparameter tuning (OLS path)

    OLS has no learning rate, no epochs, no architecture decisions. Fit once, get the global optimum. Minimal engineering overhead.

  • Provably optimal under Gauss-Markov

    When assumptions hold, OLS is the Best Linear Unbiased Estimator (BLUE) — no other linear unbiased estimator has lower variance. It's the theoretical gold standard.

  • Excellent as a baseline

    Every ML project should start with linear regression. If your fancy model can't beat it, something is wrong with your pipeline, not your data.

  • Memory efficient

    The trained model stores only d+1 numbers (weights + bias). A 100-feature model weights 808 bytes. Deploy anywhere.

  • Strictly linear decision boundary

    Cannot model XOR, concentric circles, or any non-linear pattern without manual feature engineering. The model is permanently constrained to a hyperplane.

  • Sensitive to outliers

    Squared loss gives outliers quadratic influence. One extreme point can dramatically tilt the regression line. A single outlier in a small dataset can make R² negative.

  • Assumes feature independence (no interactions)

    y = w1·x1 + w2·x2 assumes x1 and x2 contribute independently. If the effect of x1 depends on x2 (interaction), vanilla linear regression misses it.

  • Fails with multicollinearity

    When features are correlated, coefficients become unstable and hard to interpret. Small changes in data produce huge swings in coefficient values.

  • Requires feature scaling for gradient descent

    Without StandardScaler, features on different scales make gradient descent painfully slow or divergent.

10
Real Estate

House price estimation

Features: square footage, bedrooms, location score, age. Target: price. Coefficients reveal $/sqft and $/bedroom — useful for appraisers and buyers alike.

E-Commerce

Sales forecasting from ad spend

Linear model between marketing spend and revenue. Simple, auditable, and fast to update weekly. Often outperforms complex models for short-term forecasting.

Finance

Stock return prediction (factor models)

The Fama-French 3-factor model is a linear regression of stock returns on market risk, size, and value factors. Standard in quantitative finance.

Healthcare

Drug dosage optimization

Model drug concentration as a function of dosage, weight, and age. Interpretability is legally required in clinical settings.

Manufacturing

Quality control and yield prediction

Predict product defect rate from process parameters (temperature, pressure, speed). Coefficients guide process engineers directly.

Energy

Power consumption forecasting

Utility companies model electricity demand as a linear function of temperature, time-of-day, and day-of-week. Simple models scale to national grids.

11

Linear regression is the simplest regression method. Here's how it stacks up against its common alternatives:

Ridge Regression (L2)

Same linear model, same OLS foundation

Adds L2 penalty ||w||² to the loss, shrinking coefficients toward zero. Solves multicollinearity and d>n problems.

When you have multicollinearity or many features. Always try Ridge before vanilla OLS on real datasets.

Lasso Regression (L1)

Same linear model

L1 penalty |w| produces sparse solutions — many weights go exactly to zero. Acts as built-in feature selection.

When you believe many features are irrelevant and want automatic feature selection.

Polynomial Regression

Still uses linear regression under the hood

Adds polynomial feature terms (x², x³, x1·x2) as new columns before fitting. Allows modeling non-linear relationships.

When you see a clear polynomial trend in residuals and don't want to switch to a non-linear model.

Random Forest Regressor

Also solves regression (continuous target)

Non-linear, ensemble of trees. No assumptions about linearity. Handles interactions automatically. Not interpretable per-coefficient.

When relationships are non-linear, data is complex, and you don't need coefficient-level interpretability.

PropertyLinear Reg.RidgeLassoRandom Forest
Interpretable✓ Yes✓ Yes✓ Yes✗ Limited
Handles non-linearity✗ No✗ No✗ No✓ Yes
Feature selection✗ No✗ No✓ YesPartial
Handles multicollinearity✗ No✓ YesPartial✓ Yes
Training speed⚡ Instant⚡ Instant⚡ Fast🐢 Moderate
Outlier robust✗ No✗ No✗ No✓ Partial

Relationship is truly linear (check with residual plots), you need interpretability, dataset is clean, and you want a fast reliable baseline.

12

R² (Coefficient of Determination)

Fraction of variance in y explained by the model. 0 = useless, 1 = perfect. Domain-dependent what's 'good' (0.7 is great for economics, poor for physics).

Target: > 0.85 for most engineering applications

RMSE (Root Mean Squared Error)

Average error in the same units as y. Interpretable: if predicting house prices and RMSE=20000, you're off by ~$20K on average (more emphasis on large errors).

Target: Domain dependent — must compare to baseline (mean predictor RMSE)

MAE (Mean Absolute Error)

Average absolute error. More robust to outliers than RMSE. Easier to explain to non-technical stakeholders.

Target: Typically MAE < RMSE. Close RMSE and MAE means few large outlier errors.

Adjusted R²

R² adjusted for the number of features. Adding useless features increases R² but decreases adjusted R². Use this when comparing models with different feature counts.

Target: Higher is better; penalizes model complexity

  1. 01.1. Train/test split or k-fold CV — never evaluate on training data.
  2. 02.2. Compute R², RMSE, MAE on the test set.
  3. 03.3. Compare RMSE to baseline: a dummy 'predict mean' model. If your RMSE isn't much better, reconsider features.
  4. 04.4. Plot residuals vs. fitted values — look for random scatter (good) vs. patterns (bad).
  5. 05.5. Plot Q-Q plot of residuals — check for normality if you need confidence intervals.
  6. 06.6. Check for influential points: Cook's distance > 4/n flags potential outliers affecting the model heavily.
  • Never report only R² — RMSE and MAE give absolute error magnitude that R² hides.
  • R² can be high even when the model systematically misses (e.g., underestimates at high y values).
  • MSE/RMSE on imbalanced y-ranges can be misleading — consider MAPE (Mean Absolute Percentage Error) for proportional evaluation.
  • Evaluating on the training set always gives optimistic R² — can be 1.0 even for overfit models.

House price model: RMSE=$18,500, MAE=$13,200, R²=0.89. Interpretation: The model explains 89% of price variance. On average, it's off by $13.2K (MAE), but large errors can reach $18.5K (RMSE). For houses in the $200K–$500K range, this is solid (~4% relative error).

13
  • ×Confusing R² with 'accuracy' — R² is not a percentage of samples correctly predicted.
  • ×Thinking higher R² always means better model — it can hide systematic bias.
  • ×Not checking assumptions: forgetting residual plots is the #1 student mistake.
  • ×Applying linear regression to binary targets — use Logistic Regression instead.
  • ×Fitting on the full dataset (including test) and reporting those metrics as 'test performance'.
  • ×Not scaling features before gradient descent, then wondering why training diverges.
  • ×Using a single train/test split on tiny datasets instead of cross-validation.
  • ×Ignoring multicollinearity — coefficients become garbage even if predictions are decent.
  • ×Saying 'linear regression is used for classification' — it's regression (continuous output). Logistic regression is for classification.
  • ×Not knowing what R² means beyond 'how good the model is'.
  • ×Saying linear regression 'can't overfit' — it can when d is close to n.
  • ×Confusing MSE and RMSE units — MSE is in y² units, RMSE is in y units.
  • ×Using vanilla LinearRegression when Ridge or Lasso would generalize better.
  • ×Not investigating why coefficients have unexpected signs (multicollinearity, confounding).
  • ×Assuming OLS is always better than gradient descent — for n > 100K, SGDRegressor is far faster.
  • ×Deploying a model without checking for data drift — linear coefficients are sensitive to distribution shifts.
14

What kind of bias does this model have?

Linear assumptions create bias when relationships are strongly non-linear.

What kind of variance does it have?

Usually lower variance than high-capacity non-linear models.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use L1/L2 regularization, feature pruning, and stronger validation controls.

What kind of data does it like?

Works best with clean, informative features and stable train/serve distributions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • Predicts a continuous value as a weighted sum of features: ŷ = w·x + b
  • Training minimizes MSE — the average squared difference between predictions and actuals
  • OLS gives the exact analytical solution: w* = (XᵀX)⁻¹Xᵀy
  • Gradient descent is the iterative alternative — required for very large datasets
  • R² measures explained variance (1 = perfect, 0 = no better than predicting the mean)
  • Assumes linearity, independence, homoscedasticity, no perfect multicollinearity
  • Always scale features for gradient descent; check residual plots after fitting
Prediction
MSE Loss
OLS Solution
GD Update
R² Score
  • Interpretable, auditable predictions
  • Clean numeric data with approximately linear relationships
  • Fast baselines
  • Coefficient/effect size analysis
  • Target is categorical
  • Relationship is clearly non-linear
  • d >> n (use Ridge/Lasso)
  • Severe outliers without robust preprocessing
Know OLS normal equations and their derivation
Explain R², RMSE, MAE differences
Know what happens when XᵀX is singular (and how to fix it)
Explain the Gauss-Markov theorem and BLUE
Compare OLS vs. gradient descent: when to use each
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.