ML Atlas

Maximum Likelihood Estimation

Find the parameters that make your data most probable. The engine behind nearly every loss function in ML.

AdvancedMath Heavy
35 min read
Probability: PDFs, PMFs, joint distributionsCalculus: derivatives, chain rule, setting derivative = 0Logarithm rules: log(ab) = log(a) + log(b), monotonicityBasic statistics: mean, variance, Gaussian distributionFamiliarity with logistic regression and linear regression at an intuitive level
  • Every neural network trained with cross-entropy loss is implicitly performing MLE under a Bernoulli or Categorical distribution
  • Linear regression's mean squared error is exactly MLE under a Gaussian noise assumption — they are mathematically identical
  • Logistic regression: maximizing log-likelihood of Bernoulli labels is the training objective, solved via gradient descent
  • Language model training: GPT and all autoregressive models maximize the likelihood of observed token sequences
  • Scikit-learn's LogisticRegression, GaussianNB, and many other estimators use MLE under the hood
  • Expectation-Maximization (EM) algorithm for mixture models iteratively maximizes a lower bound on log-likelihood
01

In Plain English

MLE is a method for fitting a model to data. Given a family of probability distributions parameterized by θ, MLE finds the specific θ that assigns the highest probability to the data you actually observed. If you're fitting a Gaussian and you observed data clustered around 5, MLE will tell you to set the mean to 5 — because that's the parameter that makes the observed data most probable.

Why It Exists

We always need to fit models to data. The question is: what criterion should we use to decide which parameters are 'best'? MLE gives a principled probabilistic answer: the best parameters are those under which the observed data is most likely. This framework is elegant, general, and connects naturally to information theory and Bayesian inference. It's also what most ML loss functions secretly are.

Problem It Solves

Given n observed data points x₁, ..., xₙ assumed to be drawn i.i.d. from a distribution p(x|θ), find the parameter vector θ̂ that maximizes the probability of the observed data: θ̂_MLE = argmax_θ ∏ᵢ p(xᵢ|θ).

Real-Life Analogy

"Imagine you're a detective who finds a coin that came out heads 7 times in 10 flips. You want to estimate how biased the coin is. MLE says: try every possible bias p ∈ [0, 1] and calculate the probability of observing exactly 7 heads. The value of p that makes this probability highest is your estimate. Unsurprisingly, that's p = 0.7 — because no other value of p would make '7 heads in 10 flips' more likely than p = 0.7 does."

When To Use

  • You have a parametric model family and want to fit its parameters to data
  • Your data is plentiful — MLE is asymptotically optimal with large datasets
  • You want to derive or understand what a loss function actually represents
  • Building probabilistic models that output calibrated probabilities
  • Connecting model training to information-theoretic principles like KL divergence

When NOT To Use

  • Your dataset is very small — MLE has no built-in regularization and will overfit
  • Your model family is misspecified and doesn't match the true data-generating process
  • You have strong prior knowledge about plausible parameter values — use MAP or full Bayes instead
  • You need uncertainty estimates over θ itself — MLE gives a point estimate, not a posterior distribution
  • Class imbalance is severe — MLE will skew toward the majority class without correction
02

The fundamental idea of MLE is a reversal of perspective. Normally, if you know the parameters of a distribution, you can compute probabilities. MLE asks the inverse question: given data that you already observed, which parameters would have made this data most likely to appear? You treat the data as fixed and the parameters as the unknown, then search over parameter space to maximize the probability of what you saw.

The likelihood function L(θ) = ∏ᵢ p(xᵢ|θ) is the joint probability of all data points, viewed as a function of θ rather than a function of x. When data points are independent, the joint probability is the product of individual probabilities. This product can get astronomically small for large datasets, so we immediately convert to log-likelihood: ℓ(θ) = Σᵢ log p(xᵢ|θ). Because log is monotonically increasing, maximizing ℓ(θ) finds exactly the same θ as maximizing L(θ).

The connection to ML loss functions is not a coincidence — it is by design. Mean squared error emerges from assuming Gaussian noise in your targets. Cross-entropy loss emerges from assuming Bernoulli (binary) or Categorical (multi-class) output distributions. When you minimize a loss function in training, you are almost certainly maximizing a log-likelihood under some distributional assumption about your data. Understanding this connection lets you reason about what assumptions your model is implicitly making and whether they are appropriate.

MLE has powerful asymptotic guarantees: as n → ∞, the MLE estimator converges to the true parameter (consistency), achieves the lowest possible variance among unbiased estimators (efficiency, given by the Cramér-Rao bound), and becomes approximately Gaussian distributed around the true value (asymptotic normality). These properties make MLE the gold standard for large-data estimation. With small data, however, MLE can overfit dramatically — a coin that lands heads 3 out of 3 times gets MLE estimate p̂ = 1.0, which is almost certainly wrong.

The Metaphor

"MLE is like tuning a radio. The radio has a tuning knob (the parameter θ) and you're hearing a noisy signal (your data). You turn the knob across all stations and listen to which station setting makes the signal you're receiving most coherent — most likely to have produced those exact audio samples. You stop at the station that best explains what you're hearing. You're not asking what station is playing; you're asking which station setting best accounts for the sound you already heard."

Beginner Mental Model

Pick a distribution family. Imagine running the data-generating process under many different parameter values. For each θ, ask: 'If this were the true θ, how probable would my observed dataset be?' Compute that probability for every candidate θ. The winner is θ̂_MLE. In practice, you take the log, write out the sum, differentiate, set to zero, and solve — which for many distributions gives a clean closed-form formula.

03

Given n i.i.d. observations x₁, ..., xₙ from a parametric family p(x|θ), the MLE is θ̂ = argmax_θ L(θ) = argmax_θ ∏ᵢ₌₁ⁿ p(xᵢ|θ) = argmax_θ Σᵢ₌₁ⁿ log p(xᵢ|θ). Equivalently, θ̂ = argmin_θ −(1/n) Σᵢ log p(xᵢ|θ), connecting to empirical risk minimization. The score function ∇_θ log p(xᵢ|θ) and Fisher information I(θ) = E[(∇_θ log p(x|θ))²] characterize the curvature of the likelihood.

Likelihood function L(θ)
L(θ) = ∏ᵢ p(xᵢ|θ). The joint probability of the observed data, treated as a function of the parameters θ with the data held fixed. It is NOT a probability distribution over θ — it doesn't integrate to 1 over θ space. It is a function that measures how compatible θ is with the observed data.
Log-likelihood ℓ(θ)
ℓ(θ) = log L(θ) = Σᵢ log p(xᵢ|θ). The natural log of the likelihood. Converts the product of probabilities (which underflows to zero for large n) into a sum of log-probabilities. Monotonic transformation preserves the argmax. Almost always easier to differentiate analytically.
Score function
s(θ) = ∇_θ log p(x|θ). The gradient of the log-likelihood with respect to parameters. At the MLE, the sum of scores equals zero: Σᵢ s(xᵢ; θ̂) = 0. This is the first-order optimality condition. Its expectation under the true distribution is zero: E[s(θ)] = 0.
Fisher Information I(θ)
I(θ) = E[(∇_θ log p(x|θ))²] = −E[∇²_θ log p(x|θ)]. Measures how much information a single observation carries about θ. Also equals the expected curvature of the log-likelihood. The Cramér-Rao bound states Var(θ̂) ≥ 1/(n·I(θ)), and MLE achieves this bound asymptotically.
Sufficient statistic
A function T(x) of the data is sufficient for θ if the likelihood factors as L(θ) = g(T(x), θ)·h(x) — the dependence on θ comes only through T(x). Knowing T(x) captures everything the data says about θ. For Gaussian data, (x̄, s²) are sufficient for (μ, σ²). For Bernoulli, Σxᵢ is sufficient for p.
MLE vs MAP
Maximum A Posteriori (MAP) estimation adds a prior: θ̂_MAP = argmax_θ [log p(θ) + Σᵢ log p(xᵢ|θ)]. MLE is MAP with a flat (uniform) prior. A Gaussian prior on θ corresponds to L2 regularization. A Laplace prior corresponds to L1 regularization. MAP shrinks estimates toward the prior mean.
Asymptotic normality
For large n, the MLE is approximately Gaussian: √n(θ̂ − θ*) →_d N(0, I(θ*)⁻¹), where θ* is the true parameter. This result enables confidence intervals and hypothesis tests based on MLE estimates. The approximation improves as n grows.
Consistency
θ̂_MLE → θ* in probability as n → ∞ (under regularity conditions). The estimator converges to the true parameter as data increases. This is a minimal sanity property — an estimator that doesn't converge to the truth with infinite data is useless.
  1. Step 1 — Specify a model: Choose a parametric family for your data, e.g., p(xᵢ|θ). This is the distributional assumption (Gaussian, Bernoulli, Poisson, etc.).
  2. Step 2 — Write the likelihood: Under i.i.d. assumption, L(θ) = ∏ᵢ₌₁ⁿ p(xᵢ|θ).
  3. Step 3 — Take the log: ℓ(θ) = Σᵢ log p(xᵢ|θ). Products become sums, making differentiation tractable.
  4. Step 4 — Differentiate: Compute ∂ℓ/∂θ. Set each component of the gradient to zero: ∇_θ ℓ(θ) = 0.
  5. Step 5 — Solve: If a closed-form solution exists (Gaussian, Bernoulli, Poisson), solve the resulting equations. If not (logistic regression, neural networks), use gradient ascent on ℓ(θ).
  6. Step 6 — Verify it's a maximum: Check that the second derivative (Hessian) is negative definite at the solution, confirming a maximum rather than a saddle point or minimum.
  7. Step 7 — Report and interpret: θ̂_MLE is your parameter estimate. For large n, standard errors can be computed as √(I(θ̂)⁻¹/n).

A dataset of n observations {x₁, ..., xₙ}, and a choice of parametric probability distribution family p(x|θ) with parameter(s) θ.

θ̂_MLE: the parameter estimate(s) that maximize the log-likelihood. For Gaussian: (μ̂, σ̂²). For Bernoulli: p̂. For logistic regression: the weight vector ŵ.

01Data points x₁, ..., xₙ are drawn i.i.d. from the true distribution p(x|θ*)
02The model family p(x|θ) is correctly specified — the true distribution is actually in the family (misspecification leads to convergence to the KL-closest parameter, not the truth)
03The likelihood is differentiable with respect to θ and the maximum is in the interior of the parameter space (otherwise you need constrained optimization)
04Regularity conditions: the parameter space is open, the true θ* is identifiable (different θ values give different distributions), interchange of differentiation and integration is valid
  • Small n: MLE overfits. A coin with 3/3 heads gets p̂ = 1.0, ruling out tails forever.
  • Unbounded likelihood: for Gaussian mixture models, likelihood → ∞ as a component's variance → 0 and a single point sits at its mean. The MLE is degenerate.
  • Non-identifiability: multiple θ values give the same distribution. The MLE is not unique (e.g., overparameterized neural networks).
  • Misspecified model: MLE still converges, but to the θ that minimizes KL divergence from the true distribution to the model family — not to any 'true θ' per se.
  • Numerical underflow: computing ∏ᵢ p(xᵢ|θ) directly for large n always underflows. Always work in log space.
04

MLE is the theoretical foundation of the training step. When you define a loss function and run gradient descent, you are almost always performing MLE (or MAP). Understanding MLE lets you design custom loss functions, diagnose training failures, and know what distributional assumptions your model makes.

  • 01.Ensure data is representative of the target distribution — MLE fits the distribution of training data, so biased training data yields biased estimates
  • 02.Handle missing data before computing likelihoods, as missing values require special treatment (EM algorithm or imputation)
  • 03.For continuous features, check whether the chosen parametric family (e.g., Gaussian) is approximately correct using histograms or QQ-plots
  • 04.Normalize features when using iterative MLE (gradient ascent) to ensure stable convergence
  • 05.For classification, check class balance — severe imbalance will push MLE estimates toward predicting the majority class
  • 01.For closed-form MLE (Gaussian, Bernoulli, Poisson): compute sufficient statistics from data and plug into the formula
  • 02.For iterative MLE (logistic regression, neural networks): initialize parameters, compute gradient ∇_θ ℓ(θ), update θ ← θ + α·∇_θ ℓ(θ), repeat until convergence
  • 03.Monitor log-likelihood during training — it should increase monotonically (for exact gradient ascent) or on average (for stochastic methods)
  • 04.Check convergence: gradient norm ‖∇ℓ‖ < ε, or relative change in ℓ < ε
  • 05.For well-specified models with large n, the MLE should be very close to the true parameters
  1. 11. Choose and justify your distributional assumption
  2. 22. Write out log p(xᵢ|θ) symbolically for one data point
  3. 33. Sum over i to get ℓ(θ)
  4. 44. Differentiate ∂ℓ/∂θ and set to zero
  5. 55. Solve analytically OR implement gradient ascent
  6. 66. Validate: plug θ̂ back in, check ℓ(θ̂) is indeed high; verify on held-out data
05
06
python
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ─── 1. MLE for Gaussian ─────────────────────────────────────────────────────
6np.random.seed(42)
7true_mu, true_sigma = 5.0, 2.0
8data = np.random.normal(true_mu, true_sigma, size=200)
9
10# Closed-form MLE
11mu_hat = np.mean(data)                        # MLE mean
12sigma2_hat = np.mean((data - mu_hat) ** 2)   # MLE variance (biased, divides by n)
13sigma_hat = np.sqrt(sigma2_hat)
14
15print("=== Gaussian MLE ===")
16print(f"True μ: {true_mu:.3f}  |  MLE μ̂: {mu_hat:.3f}")
17print(f"True σ: {true_sigma:.3f}  |  MLE σ̂: {sigma_hat:.3f}")
18
19# Compare with scipy MLE (identical)
20mu_scipy, sigma_scipy = stats.norm.fit(data)
21print(f"scipy MLE μ̂: {mu_scipy:.3f}, σ̂: {sigma_scipy:.3f}")
22
23# ─── 2. Log-likelihood as a function of μ ────────────────────────────────────
24mu_grid = np.linspace(3, 7, 300)
25log_likelihoods = [np.sum(stats.norm.logpdf(data, loc=m, scale=sigma_hat))
26                   for m in mu_grid]
27
28# The peak of this curve is the MLE
29peak_mu = mu_grid[np.argmax(log_likelihoods)]
30print(f"\nPeak of log-likelihood curve at μ = {peak_mu:.3f}")
31
32# ─── 3. MLE for Bernoulli ─────────────────────────────────────────────────────
33true_p = 0.7
34flips = np.random.binomial(1, true_p, size=50)
35
36p_hat = np.mean(flips)   # MLE = observed frequency of 1s
37print(f"\n=== Bernoulli MLE ===")
38print(f"True p: {true_p:.2f}  |  MLE p̂: {p_hat:.3f}")
39
40# Visualize likelihood as a function of p
41p_grid = np.linspace(0.01, 0.99, 300)
42n_heads = flips.sum()
43n_total = len(flips)
44log_lik_bern = n_heads * np.log(p_grid) + (n_total - n_heads) * np.log(1 - p_grid)
45print(f"Log-likelihood maximized at p = {p_grid[np.argmax(log_lik_bern)]:.3f}")
46
47# ─── 4. MLE for Logistic Regression (gradient ascent) ─────────────────────────
48def sigmoid(z):
49    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
50
51def log_likelihood(w, X, y):
52    """Bernoulli log-likelihood under logistic model."""
53    p = sigmoid(X @ w)
54    # Clamp to avoid log(0)
55    p = np.clip(p, 1e-12, 1 - 1e-12)
56    return np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
57
58def gradient(w, X, y):
59    """∇_w ℓ = Xᵀ(y − σ(Xw))"""
60    p = sigmoid(X @ w)
61    return X.T @ (y - p)
62
63# Generate synthetic binary classification data
64np.random.seed(0)
65n_samples = 300
66X_raw = np.random.randn(n_samples, 2)
67true_w = np.array([2.0, -1.5])
68y = (sigmoid(X_raw @ true_w) > 0.5).astype(float)
69
70# Add bias column
71X = np.column_stack([np.ones(n_samples), X_raw])
72true_w_full = np.array([0.0, 2.0, -1.5])
73
74# Gradient ascent (maximizing log-likelihood)
75w = np.zeros(3)
76lr = 0.01
77history = []
78
79for iteration in range(500):
80    ll = log_likelihood(w, X, y)
81    history.append(ll)
82    w += lr * gradient(w, X, y)
83
84print(f"\n=== Logistic Regression MLE ===")
85print(f"Final log-likelihood: {history[-1]:.3f}")
86print(f"MLE weights: {w}")
87print(f"True weights: {true_w_full}")
88
89# ─── 5. Connection: MSE = Gaussian NLL ────────────────────────────────────────
90# For regression with Gaussian noise y = wᵀx + ε, ε ~ N(0, σ²):
91# NLL ∝ Σ(yᵢ - wᵀxᵢ)² = MSE × n
92# Minimizing MSE is exactly MLE under Gaussian noise assumption
93
94X_reg = np.random.randn(100, 1)
95y_reg = 3 * X_reg.squeeze() + np.random.randn(100)
96
97# MLE via normal equations (= least squares)
98X_aug = np.column_stack([np.ones(100), X_reg])
99w_mle = np.linalg.lstsq(X_aug, y_reg, rcond=None)[0]
100print(f"\n=== Linear Regression MLE (= Least Squares) ===")
101print(f"MLE weights: intercept={w_mle[0]:.3f}, slope={w_mle[1]:.3f} (true slope=3.0)")
102
103# ─── 6. MLE vs MAP with Gaussian Prior (= L2 regularization) ──────────────────
104def map_estimate(X, y, lam):
105    """MAP with Gaussian prior: (XᵀX + λI)⁻¹ Xᵀy"""
106    n_feat = X.shape[1]
107    return np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
108
109w_mle_lr = map_estimate(X_aug, y_reg, lam=0.0)   # λ=0 → MLE
110w_map_lr  = map_estimate(X_aug, y_reg, lam=10.0)  # λ>0 → MAP (ridge)
111print(f"MLE:  {w_mle_lr}")
112print(f"MAP:  {w_map_lr}  (shrunk toward zero by prior)")
113
Shows MLE in four scenarios: closed-form Gaussian estimation (mean and variance), Bernoulli estimation (coin bias), iterative gradient ascent for logistic regression (no closed form), and the MSE=Gaussian-NLL connection. Also demonstrates the L2 regularization as MAP.
data = [4.8, 5.2, 3.9, 6.1, 5.0, 4.5, 5.7, 4.3, 5.5, 5.1]  # n=10 observations
Gaussian MLE: μ̂ = 5.01, σ̂ = 0.619 (MLE, divides by n=10)
Sample std: 0.653 (unbiased, divides by n-1=9)
Log-likelihood at MLE: -9.32
Log-likelihood at μ=4.0: -12.84  (worse)
  • MSE and cross-entropy are not arbitrary choices — they follow necessarily from assuming Gaussian and Bernoulli/Categorical noise respectively. Choosing a different noise model gives a different loss function.
  • Always work in log-likelihood space. Never multiply raw probabilities for n > 50 — numerical underflow will give exactly 0.
  • For logistic regression, the log-likelihood is globally concave (the Hessian is negative semi-definite), guaranteeing gradient descent finds the global optimum.
  • scikit-learn's LogisticRegression defaults to C=1 (L2 regularization), which is MAP, not pure MLE. Set C=1e9 to approximate MLE.
  • scipy.stats distribution .fit() methods perform MLE — they're a fast way to fit standard parametric distributions.
  • The MLE for the Gaussian variance divides by n, not n−1. The n−1 version (unbiased sample variance) is NOT the MLE — knowing this difference matters in interviews.
  • Computing the product of probabilities directly for large n — always use log-likelihood (sum of log-probabilities)
  • Confusing MLE variance (1/n) with unbiased sample variance (1/(n-1)) — they are different; MLE divides by n
  • Assuming MLE always has a closed form — logistic regression, neural networks, and many other models require iterative optimization
  • Forgetting that sklearn's LogisticRegression is regularized by default (C=1) — you must set C very large to get pure MLE
  • Treating the likelihood function as a probability distribution over θ — it is not; it doesn't integrate to 1 over θ
  • Applying MLE to tiny datasets without regularization — p̂ = 1.0 for a coin that landed heads 3/3 times is MLE but a terrible estimate
07
📊

Large tabular datasets

Excellent

MLE is optimal for large samples. Asymptotic efficiency guarantees near-optimal estimates.

💡 Use logistic regression MLE for classification, Gaussian MLE for continuous targets.
🔬

Small datasets (n < 100)

Poor

MLE overfits without regularization. A coin that lands heads 3/3 gets p̂=1.0.

💡 Use MAP (regularized MLE) or full Bayesian inference with informative priors.
📝

Text / NLP

Good

Cross-entropy loss (= MLE under Categorical) is the standard training objective for language models.

💡 Laplace smoothing (add-1) is MAP regularization to avoid zero-probability unseen words.
⚖️

Imbalanced classification

Poor

MLE estimates reflect training class frequencies. On 99/1 imbalanced data, MLE skews strongly toward the majority class.

💡 Weight the loss by inverse class frequency, or use class_weight='balanced' in sklearn.
🔔

Continuous features (Gaussian model)

Excellent

MLE for Gaussian gives exact closed-form estimators (mean and variance) in O(n) time.

💡 Verify Gaussian assumption with a histogram or QQ-plot before trusting the fit.
⚠️

Misspecified model

Context-Dependent

MLE still converges, but to the parameter minimizing KL divergence from truth to model, not the 'true' parameter.

💡 Always validate your distributional assumption. Misspecification can cause systematic bias.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: mle

Log-Likelihood Surface for Gaussian Parameters

A 2D contour plot of ℓ(μ, σ²) over a grid of parameter values. The peak of the surface is at (x̄, σ̂²_MLE). The surface is concave, confirming a global maximum.

The log-likelihood for a Gaussian is a paraboloid in μ (always strictly concave) and a one-sided curve in σ² (maximum at the MLE variance). The joint maximum is unique.

Likelihood vs Parameter: Bernoulli Coin

Plot of L(p) = p^k · (1−p)^(n−k) as a function of p ∈ (0,1), for k=7 heads in n=10 flips. The peak is at p=0.7, illustrating how MLE finds the most compatible parameter.

Gradient descent convergence — MSE decreasing over iterations

Log-Likelihood Increasing During Training (Logistic Regression)

Training curve showing log-likelihood vs gradient ascent iterations for logistic regression. Starts negative and large, increases and plateaus at the MLE.

Gradient descent convergence — MSE decreasing over iterations

MLE vs MAP: Effect of Prior Strength

Comparison of MLE and MAP estimates as a function of prior strength (λ). MLE is constant (no prior). MAP shrinks toward zero as λ increases. With large n, both converge to the truth.

As regularization strength (λ = 1/C in sklearn) increases, MAP estimates are pulled toward zero. MLE is the λ=0 special case.
09
  • Principled and general

    MLE works for any parametric distribution family — Gaussian, Bernoulli, Poisson, exponential, mixture models, etc. It provides a unified framework rather than an ad-hoc fitting rule.

  • Asymptotically optimal

    For large datasets, MLE achieves the Cramér-Rao lower bound: no unbiased estimator can have lower variance. You cannot do better asymptotically.

  • Closed form for exponential families

    For Gaussian, Bernoulli, Poisson, Exponential, and other exponential family distributions, MLE yields exact formulas expressible as simple functions of sufficient statistics. O(n) computation.

  • Reveals the meaning of loss functions

    Understanding MLE explains why MSE and cross-entropy are the right choices for regression and classification respectively. It removes the 'magic' from loss function selection.

  • Invariance property

    If θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g. This makes transformations natural — you don't need to re-derive the MLE for each parameterization.

  • Foundation for model comparison

    Log-likelihood enables likelihood ratio tests, AIC/BIC model selection criteria, and cross-validated predictive performance comparisons — all built on the MLE foundation.

  • Overfits on small datasets

    MLE has no built-in regularization. With small n, estimates can be extreme (p̂=1.0 for 3/3 heads). MAP (MLE + prior) or full Bayesian inference is preferable for small data.

  • Sensitive to model misspecification

    If the true distribution is not in your assumed model family, MLE converges to the 'wrong' parameter (the KL-minimizing one). Fitting a Gaussian to heavy-tailed data gives misleading estimates.

  • No uncertainty quantification over parameters

    MLE gives a point estimate θ̂, not a posterior distribution p(θ|data). You don't know how confident to be in the estimate. Confidence intervals require additional asymptotic approximations.

  • Degeneracies in mixture models

    Gaussian mixture model likelihood is unbounded — it blows up as a component's variance → 0 with one point at its center. MLE is not well-defined without constraints.

  • Ignores class imbalance

    Pure MLE reflects the training distribution. On 99/1 imbalanced data, predictions are dominated by the majority class. Requires explicit reweighting to correct.

  • Requires choosing a distributional family

    You must commit to a parametric family upfront. If the family is wrong, MLE cannot correct for it. Non-parametric methods avoid this commitment at the cost of higher variance.

10
Natural Language Processing

Language model training

GPT, BERT, and all modern language models are trained by maximizing the likelihood of observed token sequences. The cross-entropy loss IS the negative log-likelihood under a Categorical distribution.

Finance

Risk model calibration

Fitting parametric distributions (log-normal, t-distribution) to asset return data using MLE to calibrate Value at Risk (VaR) and other risk metrics.

Healthcare

Survival analysis

MLE fits Weibull, exponential, or Cox proportional hazards models to time-to-event data, handling censoring naturally through the likelihood formulation.

Computer Vision

Generative models

Variational Autoencoders (VAEs) maximize a lower bound on log-likelihood. Normalizing flows maximize exact likelihood. Both use MLE as the training criterion.

Telecommunications

Signal parameter estimation

MLE estimates signal frequency, amplitude, and noise variance from received samples. The MLE is the standard estimator in radar, sonar, and communications systems.

Biology / Genetics

Phylogenetic tree inference

Maximum likelihood phylogenetics (e.g., RAxML, IQ-TREE) fits evolutionary models to DNA sequence alignment data using MLE, producing the most likely evolutionary tree.

11

MLE is a point estimation method. It sits in a spectrum from pure data-driven (MLE) through prior-informed (MAP) to fully probabilistic (Bayesian). It also contrasts with non-parametric methods that avoid distributional assumptions entirely.

MAP (Maximum A Posteriori)

Also maximizes an objective derived from Bayes theorem; uses the same data likelihood

Adds log p(θ) (log prior) to the objective. Equivalent to regularized MLE. Gives a point estimate like MLE but is pulled toward the prior.

When you have prior knowledge or small data and want regularization without full Bayesian inference.

Full Bayesian Inference

Uses the same likelihood function p(data|θ); MLE is the mode of the posterior with flat prior

Computes the full posterior p(θ|data) rather than a point estimate. Provides calibrated uncertainty. Requires specifying a prior and often intractable integrals (solved by MCMC or VI).

When you need uncertainty estimates, have small data, or want to propagate uncertainty through predictions.

Method of Moments

Also produces point estimates for parametric models

Matches sample moments (mean, variance) to theoretical moments rather than maximizing likelihood. Computationally simpler but asymptotically less efficient than MLE.

Quick baseline estimates or when the likelihood is intractable but moments are easily computed.

Least Squares (OLS)

Identical to MLE for linear regression under Gaussian noise — they produce the same estimates

OLS minimizes residual sum of squares without a probabilistic framing. MLE shows WHY minimizing squared errors is the right thing to do (Gaussian noise assumption).

OLS is fine for linear regression; MLE framing is needed when you want probabilities or non-Gaussian noise.

CriterionMLEMAPBayesianLeast Squares
OutputPoint estimate θ̂Point estimate θ̂Full posterior p(θ|data)Point estimate θ̂
RegularizationNoneYes (via prior)Yes (via prior)Optional (ridge/lasso)
Small data behaviorOverfitsBetter (shrinks)Best (uncertainty)Overfits
Uncertainty quantificationApproximate (CI)Approximate (CI)Exact (posterior)Approximate (CI)
Computational costLow-MediumLow-MediumHigh (MCMC/VI)Low
Distributional assumptionRequiredRequired + priorRequired + priorImplicitly Gaussian

You have abundant data, a well-specified parametric model, no strong prior knowledge, and need a computationally efficient point estimate with theoretical optimality guarantees.

12

Log-Likelihood

Higher (less negative) is better. Measures how well the fitted model explains the training data. Can overfit — evaluate on held-out data.

Target: Compare across models on the same dataset — no absolute scale. AIC = −2ℓ + 2k penalizes complexity.

AIC (Akaike Information Criterion)

Model selection criterion balancing fit and complexity. k = number of parameters. Lower AIC is better. Penalizes overfitting by penalizing extra parameters.

Target: Used comparatively. ΔAIC < 2 is negligible difference; ΔAIC > 10 is strong evidence for the better model.

BIC (Bayesian Information Criterion)

Like AIC but with a stronger penalty for parameters (log(n) vs 2). More conservative — prefers simpler models, especially for large n.

Target: Prefer BIC when n is large and you want sparser models; prefer AIC when prediction accuracy matters most.

Likelihood Ratio Test Statistic

Tests whether a full model fits significantly better than a restricted (null) model. df = difference in number of parameters. Large LR → reject the null model.

Target: p < 0.05 or p < 0.01 depending on application. Requires the null model to be nested in the full model.

Perplexity (NLP)

Geometric mean of inverse per-token probability. Lower perplexity = model is less 'surprised' by data. Standard metric for language model evaluation.

Target: Depends heavily on vocabulary size and domain. Lower is always better. GPT-4 has perplexity < 10 on standard benchmarks.

  1. 01.Fit the MLE parameters on training data
  2. 02.Evaluate log-likelihood on a held-out test set (not training data!) — training log-likelihood always overestimates generalization
  3. 03.Compute AIC/BIC if comparing multiple model families of different complexity
  4. 04.Perform likelihood ratio tests for nested model comparisons
  5. 05.Check residuals or posterior predictive samples to validate the distributional assumption
  6. 06.For classification models (logistic regression), additionally evaluate accuracy, AUC-ROC, and calibration (reliability diagrams)
  • Never compare log-likelihoods across different dataset sizes — they are not comparable (log-likelihood scales with n)
  • Evaluating on training data gives optimistically biased log-likelihood — always use held-out data or cross-validation
  • A higher log-likelihood doesn't mean the model assumption is correct — a misspecified model can have high likelihood by chance
  • Likelihood ratio tests are only valid for nested models — the LR statistic does not have a chi-squared distribution for non-nested models

You fit a Gaussian model to 1000 data points and get ℓ(θ̂) = −2103 on the test set. A competing Student-t model gives ℓ = −2085. The t-model has 1 extra parameter (degrees of freedom). AIC comparison: Gaussian AIC = 4208, t-model AIC = 4172. The t-model wins by ΔAIC = 36 — overwhelming evidence for the heavier-tailed model. Biological interpretation: your data has more extreme outliers than a Gaussian predicts.

13
  • ×Computing ∏ p(xᵢ|θ) numerically for large n and getting 0 due to underflow — always sum log-probabilities instead
  • ×Thinking the likelihood L(θ) is a probability distribution over θ — it is not (it doesn't integrate to 1 over θ)
  • ×Confusing MLE variance (divides by n) with unbiased sample variance (divides by n-1) — they are different estimators with different properties
  • ×Assuming MLE always has a closed form — most interesting models (logistic regression, neural networks) require iterative optimization
  • ×Forgetting to check whether the stationary point of ℓ(θ) is a maximum, not a minimum or saddle point
  • ×Using sklearn's LogisticRegression without realizing it defaults to C=1 (L2-regularized MAP), not pure MLE
  • ×Not working in log-space when implementing likelihood functions — numerical underflow kills probabilities silently
  • ×Failing to add gradient clipping when implementing gradient ascent on likelihood — log can blow up near p=0
  • ×Applying MLE without considering whether the distributional assumption fits the data (e.g., using Gaussian when data is heavily skewed)
  • ×Not checking for class imbalance before MLE training — the loss will be dominated by the majority class
  • ×Claiming cross-entropy 'comes from information theory' without connecting it to Bernoulli/Categorical MLE — both explanations are valid, but interviewers often want the probabilistic derivation
  • ×Saying MLE variance is unbiased — it's not (it divides by n, not n-1). The biased nature is a common interview trap
  • ×Not being able to state what distributional assumption underlies MSE (Gaussian noise) or cross-entropy (Bernoulli/Categorical)
  • ×Confusing consistency (convergence to true value) with unbiasedness (zero expected error for any n) — MLE variance is consistent but biased
  • ×Unable to explain why we take the log — the product underflows AND differentiating sums is much easier than products
  • ×Ignoring model misspecification: fitting a Gaussian to log-normal data and reporting mean/variance as if they're meaningful
  • ×Using pure MLE (C=1e9 in sklearn) on small or noisy datasets without regularization — overfitting is severe
  • ×Not validating distributional assumptions with goodness-of-fit tests or visual checks before using MLE estimates for downstream decisions
  • ×Forgetting that likelihood ratio tests require NESTED models — comparing Gaussian vs Gamma with LRT gives invalid p-values
  • ×Treating MLE point estimates as certain in downstream computations without propagating estimation uncertainty
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • MLE finds θ that maximizes the probability of observed data: θ̂ = argmax_θ Σᵢ log p(xᵢ|θ)
  • Always work in log-space: products → sums, prevents underflow, simplifies differentiation
  • Gaussian MLE: μ̂ = x̄, σ̂² = (1/n)Σ(xᵢ-x̄)² — note 1/n not 1/(n-1), so σ̂² is biased
  • Bernoulli MLE: p̂ = (number of 1s) / n — the observed frequency
  • MSE = negative log-likelihood under Gaussian noise; cross-entropy = under Bernoulli/Categorical
  • MLE has no closed form for logistic regression — gradient ascent on the concave log-likelihood
  • MAP = MLE + log prior. Gaussian prior → L2 regularization. Laplace prior → L1 regularization
  • Asymptotic properties: consistent (θ̂ → θ*), efficient (achieves Cramér-Rao bound), asymptotically normal
  • Likelihood is a function of θ, NOT a probability distribution over θ
  • Small data: MLE overfits. Large data: MLE is optimal. Always regularize with MAP for small n
MLE objective
Gaussian MLE mean
Gaussian MLE variance (biased)
Bernoulli MLE
NLL = Loss
Logistic regression gradient
MAP objective
Cramér-Rao bound
  • Large datasets where asymptotic efficiency matters
  • Deriving and understanding loss functions from first principles
  • Fitting standard parametric distributions (Gaussian, Bernoulli, Poisson)
  • Any model where a distributional assumption is justified by domain knowledge
  • Foundation for gradient-based training of probabilistic models
  • Small datasets without regularization
  • Model family is clearly misspecified
  • You need full uncertainty over parameters (use Bayesian inference)
  • Severe class imbalance without reweighting
MSE = Gaussian MLE; Cross-entropy = Bernoulli/Categorical MLE
Why log-likelihood: products underflow, sums are easier to differentiate, monotone so same argmax
MLE Gaussian variance divides by n (biased), unbiased sample variance divides by n-1
MAP = MLE + prior; L2 regularization = Gaussian prior; L1 = Laplace prior
Logistic regression log-likelihood is globally concave → gradient ascent finds global optimum
MLE is consistent, asymptotically efficient, asymptotically normal
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.