Maximum Likelihood Estimation

Concept Overview

In Plain English

MLE is a method for fitting a model to data. Given a family of probability distributions parameterized by θ, MLE finds the specific θ that assigns the highest probability to the data you actually observed. If you're fitting a Gaussian and you observed data clustered around 5, MLE will tell you to set the mean to 5 — because that's the parameter that makes the observed data most probable.

Why It Exists

We always need to fit models to data. The question is: what criterion should we use to decide which parameters are 'best'? MLE gives a principled probabilistic answer: the best parameters are those under which the observed data is most likely. This framework is elegant, general, and connects naturally to information theory and Bayesian inference. It's also what most ML loss functions secretly are.

Problem It Solves

Given n observed data points x₁, ..., xₙ assumed to be drawn i.i.d. from a distribution p(x|θ), find the parameter vector θ̂ that maximizes the probability of the observed data: θ̂_MLE = argmax_θ ∏ᵢ p(xᵢ|θ).

Real-Life Analogy

"Imagine you're a detective who finds a coin that came out heads 7 times in 10 flips. You want to estimate how biased the coin is. MLE says: try every possible bias p ∈ [0, 1] and calculate the probability of observing exactly 7 heads. The value of p that makes this probability highest is your estimate. Unsurprisingly, that's p = 0.7 — because no other value of p would make '7 heads in 10 flips' more likely than p = 0.7 does."

When To Use

You have a parametric model family and want to fit its parameters to data
Your data is plentiful — MLE is asymptotically optimal with large datasets
You want to derive or understand what a loss function actually represents
Building probabilistic models that output calibrated probabilities
Connecting model training to information-theoretic principles like KL divergence

When NOT To Use

Your dataset is very small — MLE has no built-in regularization and will overfit
Your model family is misspecified and doesn't match the true data-generating process
You have strong prior knowledge about plausible parameter values — use MAP or full Bayes instead
You need uncertainty estimates over θ itself — MLE gives a point estimate, not a posterior distribution
Class imbalance is severe — MLE will skew toward the majority class without correction

Core Intuition

The fundamental idea of MLE is a reversal of perspective. Normally, if you know the parameters of a distribution, you can compute probabilities. MLE asks the inverse question: given data that you already observed, which parameters would have made this data most likely to appear? You treat the data as fixed and the parameters as the unknown, then search over parameter space to maximize the probability of what you saw.

The likelihood function L(θ) = ∏ᵢ p(xᵢ|θ) is the joint probability of all data points, viewed as a function of θ rather than a function of x. When data points are independent, the joint probability is the product of individual probabilities. This product can get astronomically small for large datasets, so we immediately convert to log-likelihood: ℓ(θ) = Σᵢ log p(xᵢ|θ). Because log is monotonically increasing, maximizing ℓ(θ) finds exactly the same θ as maximizing L(θ).

The connection to ML loss functions is not a coincidence — it is by design. Mean squared error emerges from assuming Gaussian noise in your targets. Cross-entropy loss emerges from assuming Bernoulli (binary) or Categorical (multi-class) output distributions. When you minimize a loss function in training, you are almost certainly maximizing a log-likelihood under some distributional assumption about your data. Understanding this connection lets you reason about what assumptions your model is implicitly making and whether they are appropriate.

MLE has powerful asymptotic guarantees: as n → ∞, the MLE estimator converges to the true parameter (consistency), achieves the lowest possible variance among unbiased estimators (efficiency, given by the Cramér-Rao bound), and becomes approximately Gaussian distributed around the true value (asymptotic normality). These properties make MLE the gold standard for large-data estimation. With small data, however, MLE can overfit dramatically — a coin that lands heads 3 out of 3 times gets MLE estimate p̂ = 1.0, which is almost certainly wrong.

The Metaphor

"MLE is like tuning a radio. The radio has a tuning knob (the parameter θ) and you're hearing a noisy signal (your data). You turn the knob across all stations and listen to which station setting makes the signal you're receiving most coherent — most likely to have produced those exact audio samples. You stop at the station that best explains what you're hearing. You're not asking what station is playing; you're asking which station setting best accounts for the sound you already heard."

Beginner Mental Model

Pick a distribution family. Imagine running the data-generating process under many different parameter values. For each θ, ask: 'If this were the true θ, how probable would my observed dataset be?' Compute that probability for every candidate θ. The winner is θ̂_MLE. In practice, you take the log, write out the sum, differentiate, set to zero, and solve — which for many distributions gives a clean closed-form formula.

Technical Theory

Formal Definition

Given n i.i.d. observations x₁, ..., xₙ from a parametric family p(x|θ), the MLE is θ̂ = argmax_θ L(θ) = argmax_θ ∏ᵢ₌₁ⁿ p(xᵢ|θ) = argmax_θ Σᵢ₌₁ⁿ log p(xᵢ|θ). Equivalently, θ̂ = argmin_θ −(1/n) Σᵢ log p(xᵢ|θ), connecting to empirical risk minimization. The score function ∇_θ log p(xᵢ|θ) and Fisher information I(θ) = E[(∇_θ log p(x|θ))²] characterize the curvature of the likelihood.

Key Terms

Likelihood function L(θ): L(θ) = ∏ᵢ p(xᵢ|θ). The joint probability of the observed data, treated as a function of the parameters θ with the data held fixed. It is NOT a probability distribution over θ — it doesn't integrate to 1 over θ space. It is a function that measures how compatible θ is with the observed data.
Log-likelihood ℓ(θ): ℓ(θ) = log L(θ) = Σᵢ log p(xᵢ|θ). The natural log of the likelihood. Converts the product of probabilities (which underflows to zero for large n) into a sum of log-probabilities. Monotonic transformation preserves the argmax. Almost always easier to differentiate analytically.
Score function: s(θ) = ∇_θ log p(x|θ). The gradient of the log-likelihood with respect to parameters. At the MLE, the sum of scores equals zero: Σᵢ s(xᵢ; θ̂) = 0. This is the first-order optimality condition. Its expectation under the true distribution is zero: E[s(θ)] = 0.
Fisher Information I(θ): I(θ) = E[(∇_θ log p(x|θ))²] = −E[∇²_θ log p(x|θ)]. Measures how much information a single observation carries about θ. Also equals the expected curvature of the log-likelihood. The Cramér-Rao bound states Var(θ̂) ≥ 1/(n·I(θ)), and MLE achieves this bound asymptotically.
Sufficient statistic: A function T(x) of the data is sufficient for θ if the likelihood factors as L(θ) = g(T(x), θ)·h(x) — the dependence on θ comes only through T(x). Knowing T(x) captures everything the data says about θ. For Gaussian data, (x̄, s²) are sufficient for (μ, σ²). For Bernoulli, Σxᵢ is sufficient for p.
MLE vs MAP: Maximum A Posteriori (MAP) estimation adds a prior: θ̂_MAP = argmax_θ [log p(θ) + Σᵢ log p(xᵢ|θ)]. MLE is MAP with a flat (uniform) prior. A Gaussian prior on θ corresponds to L2 regularization. A Laplace prior corresponds to L1 regularization. MAP shrinks estimates toward the prior mean.
Asymptotic normality: For large n, the MLE is approximately Gaussian: √n(θ̂ − θ*) →_d N(0, I(θ*)⁻¹), where θ* is the true parameter. This result enables confidence intervals and hypothesis tests based on MLE estimates. The approximation improves as n grows.
Consistency: θ̂_MLE → θ* in probability as n → ∞ (under regularity conditions). The estimator converges to the true parameter as data increases. This is a minimal sanity property — an estimator that doesn't converge to the truth with infinite data is useless.

Step-by-Step Working

Step 1 — Specify a model: Choose a parametric family for your data, e.g., p(xᵢ|θ). This is the distributional assumption (Gaussian, Bernoulli, Poisson, etc.).
Step 2 — Write the likelihood: Under i.i.d. assumption, L(θ) = ∏ᵢ₌₁ⁿ p(xᵢ|θ).
Step 3 — Take the log: ℓ(θ) = Σᵢ log p(xᵢ|θ). Products become sums, making differentiation tractable.
Step 4 — Differentiate: Compute ∂ℓ/∂θ. Set each component of the gradient to zero: ∇_θ ℓ(θ) = 0.
Step 5 — Solve: If a closed-form solution exists (Gaussian, Bernoulli, Poisson), solve the resulting equations. If not (logistic regression, neural networks), use gradient ascent on ℓ(θ).
Step 6 — Verify it's a maximum: Check that the second derivative (Hessian) is negative definite at the solution, confirming a maximum rather than a saddle point or minimum.
Step 7 — Report and interpret: θ̂_MLE is your parameter estimate. For large n, standard errors can be computed as √(I(θ̂)⁻¹/n).

Inputs

A dataset of n observations {x₁, ..., xₙ}, and a choice of parametric probability distribution family p(x|θ) with parameter(s) θ.

Outputs

θ̂_MLE: the parameter estimate(s) that maximize the log-likelihood. For Gaussian: (μ̂, σ̂²). For Bernoulli: p̂. For logistic regression: the weight vector ŵ.

Model Assumptions

01Data points x₁, ..., xₙ are drawn i.i.d. from the true distribution p(x|θ*)

02The model family p(x|θ) is correctly specified — the true distribution is actually in the family (misspecification leads to convergence to the KL-closest parameter, not the truth)

03The likelihood is differentiable with respect to θ and the maximum is in the interior of the parameter space (otherwise you need constrained optimization)

04Regularity conditions: the parameter space is open, the true θ* is identifiable (different θ values give different distributions), interchange of differentiation and integration is valid

Important Edge Cases

▸Small n: MLE overfits. A coin with 3/3 heads gets p̂ = 1.0, ruling out tails forever.
▸Unbounded likelihood: for Gaussian mixture models, likelihood → ∞ as a component's variance → 0 and a single point sits at its mean. The MLE is degenerate.
▸Non-identifiability: multiple θ values give the same distribution. The MLE is not unique (e.g., overparameterized neural networks).
▸Misspecified model: MLE still converges, but to the θ that minimizes KL divergence from the true distribution to the model family — not to any 'true θ' per se.
▸Numerical underflow: computing ∏ᵢ p(xᵢ|θ) directly for large n always underflows. Always work in log space.

Methodology / Workflow

Role in the ML Pipeline

MLE is the theoretical foundation of the training step. When you define a loss function and run gradient descent, you are almost always performing MLE (or MAP). Understanding MLE lets you design custom loss functions, diagnose training failures, and know what distributional assumptions your model makes.

Data Preprocessing

01.Ensure data is representative of the target distribution — MLE fits the distribution of training data, so biased training data yields biased estimates
02.Handle missing data before computing likelihoods, as missing values require special treatment (EM algorithm or imputation)
03.For continuous features, check whether the chosen parametric family (e.g., Gaussian) is approximately correct using histograms or QQ-plots
04.Normalize features when using iterative MLE (gradient ascent) to ensure stable convergence
05.For classification, check class balance — severe imbalance will push MLE estimates toward predicting the majority class

Training Process

01.For closed-form MLE (Gaussian, Bernoulli, Poisson): compute sufficient statistics from data and plug into the formula
02.For iterative MLE (logistic regression, neural networks): initialize parameters, compute gradient ∇_θ ℓ(θ), update θ ← θ + α·∇_θ ℓ(θ), repeat until convergence
03.Monitor log-likelihood during training — it should increase monotonically (for exact gradient ascent) or on average (for stochastic methods)
04.Check convergence: gradient norm ‖∇ℓ‖ < ε, or relative change in ℓ < ε
05.For well-specified models with large n, the MLE should be very close to the true parameters

Implementation Checklist

11. Choose and justify your distributional assumption
22. Write out log p(xᵢ|θ) symbolically for one data point
33. Sum over i to get ℓ(θ)
44. Differentiate ∂ℓ/∂θ and set to zero
55. Solve analytically OR implement gradient ascent
66. Validate: plug θ̂ back in, check ℓ(θ̂) is indeed high; verify on held-out data

Mathematical Chamber

Implementation

python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ─── 1. MLE for Gaussian ─────────────────────────────────────────────────────
6np.random.seed(42)
7true_mu, true_sigma = 5.0, 2.0
8data = np.random.normal(true_mu, true_sigma, size=200)
9
10# Closed-form MLE
11mu_hat = np.mean(data)                        # MLE mean
12sigma2_hat = np.mean((data - mu_hat) ** 2)   # MLE variance (biased, divides by n)
13sigma_hat = np.sqrt(sigma2_hat)
14
15print("=== Gaussian MLE ===")
16print(f"True μ: {true_mu:.3f}  |  MLE μ̂: {mu_hat:.3f}")
17print(f"True σ: {true_sigma:.3f}  |  MLE σ̂: {sigma_hat:.3f}")
18
19# Compare with scipy MLE (identical)
20mu_scipy, sigma_scipy = stats.norm.fit(data)
21print(f"scipy MLE μ̂: {mu_scipy:.3f}, σ̂: {sigma_scipy:.3f}")
22
23# ─── 2. Log-likelihood as a function of μ ────────────────────────────────────
24mu_grid = np.linspace(3, 7, 300)
25log_likelihoods = [np.sum(stats.norm.logpdf(data, loc=m, scale=sigma_hat))
26                   for m in mu_grid]
27
28# The peak of this curve is the MLE
29peak_mu = mu_grid[np.argmax(log_likelihoods)]
30print(f"\nPeak of log-likelihood curve at μ = {peak_mu:.3f}")
31
32# ─── 3. MLE for Bernoulli ─────────────────────────────────────────────────────
33true_p = 0.7
34flips = np.random.binomial(1, true_p, size=50)
35
36p_hat = np.mean(flips)   # MLE = observed frequency of 1s
37print(f"\n=== Bernoulli MLE ===")
38print(f"True p: {true_p:.2f}  |  MLE p̂: {p_hat:.3f}")
39
40# Visualize likelihood as a function of p
41p_grid = np.linspace(0.01, 0.99, 300)
42n_heads = flips.sum()
43n_total = len(flips)
44log_lik_bern = n_heads * np.log(p_grid) + (n_total - n_heads) * np.log(1 - p_grid)
45print(f"Log-likelihood maximized at p = {p_grid[np.argmax(log_lik_bern)]:.3f}")
46
47# ─── 4. MLE for Logistic Regression (gradient ascent) ─────────────────────────
48def sigmoid(z):
49    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
50
51def log_likelihood(w, X, y):
52    """Bernoulli log-likelihood under logistic model."""
53    p = sigmoid(X @ w)
54    # Clamp to avoid log(0)
55    p = np.clip(p, 1e-12, 1 - 1e-12)
56    return np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
57
58def gradient(w, X, y):
59    """∇_w ℓ = Xᵀ(y − σ(Xw))"""
60    p = sigmoid(X @ w)
61    return X.T @ (y - p)
62
63# Generate synthetic binary classification data
64np.random.seed(0)
65n_samples = 300
66X_raw = np.random.randn(n_samples, 2)
67true_w = np.array([2.0, -1.5])
68y = (sigmoid(X_raw @ true_w) > 0.5).astype(float)
69
70# Add bias column
71X = np.column_stack([np.ones(n_samples), X_raw])
72true_w_full = np.array([0.0, 2.0, -1.5])
73
74# Gradient ascent (maximizing log-likelihood)
75w = np.zeros(3)
76lr = 0.01
77history = []
78
79for iteration in range(500):
80    ll = log_likelihood(w, X, y)
81    history.append(ll)
82    w += lr * gradient(w, X, y)
83
84print(f"\n=== Logistic Regression MLE ===")
85print(f"Final log-likelihood: {history[-1]:.3f}")
86print(f"MLE weights: {w}")
87print(f"True weights: {true_w_full}")
88
89# ─── 5. Connection: MSE = Gaussian NLL ────────────────────────────────────────
90# For regression with Gaussian noise y = wᵀx + ε, ε ~ N(0, σ²):
91# NLL ∝ Σ(yᵢ - wᵀxᵢ)² = MSE × n
92# Minimizing MSE is exactly MLE under Gaussian noise assumption
93
94X_reg = np.random.randn(100, 1)
95y_reg = 3 * X_reg.squeeze() + np.random.randn(100)
96
97# MLE via normal equations (= least squares)
98X_aug = np.column_stack([np.ones(100), X_reg])
99w_mle = np.linalg.lstsq(X_aug, y_reg, rcond=None)[0]
100print(f"\n=== Linear Regression MLE (= Least Squares) ===")
101print(f"MLE weights: intercept={w_mle[0]:.3f}, slope={w_mle[1]:.3f} (true slope=3.0)")
102
103# ─── 6. MLE vs MAP with Gaussian Prior (= L2 regularization) ──────────────────
104def map_estimate(X, y, lam):
105    """MAP with Gaussian prior: (XᵀX + λI)⁻¹ Xᵀy"""
106    n_feat = X.shape[1]
107    return np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
108
109w_mle_lr = map_estimate(X_aug, y_reg, lam=0.0)   # λ=0 → MLE
110w_map_lr  = map_estimate(X_aug, y_reg, lam=10.0)  # λ>0 → MAP (ridge)
111print(f"MLE:  {w_mle_lr}")
112print(f"MAP:  {w_map_lr}  (shrunk toward zero by prior)")
113

Shows MLE in four scenarios: closed-form Gaussian estimation (mean and variance), Bernoulli estimation (coin bias), iterative gradient ascent for logistic regression (no closed form), and the MSE=Gaussian-NLL connection. Also demonstrates the L2 regularization as MAP.

Sample Input

data = [4.8, 5.2, 3.9, 6.1, 5.0, 4.5, 5.7, 4.3, 5.5, 5.1]  # n=10 observations

Sample Output

Gaussian MLE: μ̂ = 5.01, σ̂ = 0.619 (MLE, divides by n=10)
Sample std: 0.653 (unbiased, divides by n-1=9)
Log-likelihood at MLE: -9.32
Log-likelihood at μ=4.0: -12.84  (worse)

Key Implementation Insights

→MSE and cross-entropy are not arbitrary choices — they follow necessarily from assuming Gaussian and Bernoulli/Categorical noise respectively. Choosing a different noise model gives a different loss function.
→Always work in log-likelihood space. Never multiply raw probabilities for n > 50 — numerical underflow will give exactly 0.
→For logistic regression, the log-likelihood is globally concave (the Hessian is negative semi-definite), guaranteeing gradient descent finds the global optimum.
→scikit-learn's LogisticRegression defaults to C=1 (L2 regularization), which is MAP, not pure MLE. Set C=1e9 to approximate MLE.
→scipy.stats distribution .fit() methods perform MLE — they're a fast way to fit standard parametric distributions.
→The MLE for the Gaussian variance divides by n, not n−1. The n−1 version (unbiased sample variance) is NOT the MLE — knowing this difference matters in interviews.

Common Implementation Mistakes

✗Computing the product of probabilities directly for large n — always use log-likelihood (sum of log-probabilities)
✗Confusing MLE variance (1/n) with unbiased sample variance (1/(n-1)) — they are different; MLE divides by n
✗Assuming MLE always has a closed form — logistic regression, neural networks, and many other models require iterative optimization
✗Forgetting that sklearn's LogisticRegression is regularized by default (C=1) — you must set C very large to get pure MLE
✗Treating the likelihood function as a probability distribution over θ — it is not; it doesn't integrate to 1 over θ
✗Applying MLE to tiny datasets without regularization — p̂ = 1.0 for a coin that landed heads 3/3 times is MLE but a terrible estimate

Dataset Applicability

📊

Large tabular datasets

Excellent

MLE is optimal for large samples. Asymptotic efficiency guarantees near-optimal estimates.

💡 Use logistic regression MLE for classification, Gaussian MLE for continuous targets.

🔬

Small datasets (n < 100)

Poor

MLE overfits without regularization. A coin that lands heads 3/3 gets p̂=1.0.

💡 Use MAP (regularized MLE) or full Bayesian inference with informative priors.

📝

Text / NLP

Good

Cross-entropy loss (= MLE under Categorical) is the standard training objective for language models.

💡 Laplace smoothing (add-1) is MAP regularization to avoid zero-probability unseen words.

⚖️

Imbalanced classification

Poor

MLE estimates reflect training class frequencies. On 99/1 imbalanced data, MLE skews strongly toward the majority class.

💡 Weight the loss by inverse class frequency, or use class_weight='balanced' in sklearn.

🔔

Continuous features (Gaussian model)

Excellent

MLE for Gaussian gives exact closed-form estimators (mean and variance) in O(n) time.

💡 Verify Gaussian assumption with a histogram or QQ-plot before trusting the fit.

⚠️

Misspecified model

Context-Dependent

MLE still converges, but to the parameter minimizing KL divergence from truth to model, not the 'true' parameter.

💡 Always validate your distributional assumption. Misspecification can cause systematic bias.

Visualizations

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: mle

Log-Likelihood Surface for Gaussian Parameters

A 2D contour plot of ℓ(μ, σ²) over a grid of parameter values. The peak of the surface is at (x̄, σ̂²_MLE). The surface is concave, confirming a global maximum.

The log-likelihood for a Gaussian is a paraboloid in μ (always strictly concave) and a one-sided curve in σ² (maximum at the MLE variance). The joint maximum is unique.

Likelihood vs Parameter: Bernoulli Coin

Plot of L(p) = p^k · (1−p)^(n−k) as a function of p ∈ (0,1), for k=7 heads in n=10 flips. The peak is at p=0.7, illustrating how MLE finds the most compatible parameter.

Gradient descent convergence — MSE decreasing over iterations

Log-Likelihood Increasing During Training (Logistic Regression)

Training curve showing log-likelihood vs gradient ascent iterations for logistic regression. Starts negative and large, increases and plateaus at the MLE.

Gradient descent convergence — MSE decreasing over iterations

MLE vs MAP: Effect of Prior Strength

Comparison of MLE and MAP estimates as a function of prior strength (λ). MLE is constant (no prior). MAP shrinks toward zero as λ increases. With large n, both converge to the truth.

As regularization strength (λ = 1/C in sklearn) increases, MAP estimates are pulled toward zero. MLE is the λ=0 special case.

Advantages & Limitations

Advantages

Principled and general
MLE works for any parametric distribution family — Gaussian, Bernoulli, Poisson, exponential, mixture models, etc. It provides a unified framework rather than an ad-hoc fitting rule.
Asymptotically optimal
For large datasets, MLE achieves the Cramér-Rao lower bound: no unbiased estimator can have lower variance. You cannot do better asymptotically.
Closed form for exponential families
For Gaussian, Bernoulli, Poisson, Exponential, and other exponential family distributions, MLE yields exact formulas expressible as simple functions of sufficient statistics. O(n) computation.
Reveals the meaning of loss functions
Understanding MLE explains why MSE and cross-entropy are the right choices for regression and classification respectively. It removes the 'magic' from loss function selection.
Invariance property
If θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g. This makes transformations natural — you don't need to re-derive the MLE for each parameterization.
Foundation for model comparison
Log-likelihood enables likelihood ratio tests, AIC/BIC model selection criteria, and cross-validated predictive performance comparisons — all built on the MLE foundation.

Limitations

Overfits on small datasets
MLE has no built-in regularization. With small n, estimates can be extreme (p̂=1.0 for 3/3 heads). MAP (MLE + prior) or full Bayesian inference is preferable for small data.
Sensitive to model misspecification
If the true distribution is not in your assumed model family, MLE converges to the 'wrong' parameter (the KL-minimizing one). Fitting a Gaussian to heavy-tailed data gives misleading estimates.
No uncertainty quantification over parameters
MLE gives a point estimate θ̂, not a posterior distribution p(θ|data). You don't know how confident to be in the estimate. Confidence intervals require additional asymptotic approximations.
Degeneracies in mixture models
Gaussian mixture model likelihood is unbounded — it blows up as a component's variance → 0 with one point at its center. MLE is not well-defined without constraints.
Ignores class imbalance
Pure MLE reflects the training distribution. On 99/1 imbalanced data, predictions are dominated by the majority class. Requires explicit reweighting to correct.
Requires choosing a distributional family
You must commit to a parametric family upfront. If the family is wrong, MLE cannot correct for it. Non-parametric methods avoid this commitment at the cost of higher variance.

Practical Use Cases

Natural Language Processing

Language model training

GPT, BERT, and all modern language models are trained by maximizing the likelihood of observed token sequences. The cross-entropy loss IS the negative log-likelihood under a Categorical distribution.

Finance

Risk model calibration

Fitting parametric distributions (log-normal, t-distribution) to asset return data using MLE to calibrate Value at Risk (VaR) and other risk metrics.

Healthcare

Survival analysis

MLE fits Weibull, exponential, or Cox proportional hazards models to time-to-event data, handling censoring naturally through the likelihood formulation.

Computer Vision

Generative models

Variational Autoencoders (VAEs) maximize a lower bound on log-likelihood. Normalizing flows maximize exact likelihood. Both use MLE as the training criterion.

Telecommunications

Signal parameter estimation

MLE estimates signal frequency, amplitude, and noise variance from received samples. The MLE is the standard estimator in radar, sonar, and communications systems.

Biology / Genetics

Phylogenetic tree inference

Maximum likelihood phylogenetics (e.g., RAxML, IQ-TREE) fits evolutionary models to DNA sequence alignment data using MLE, producing the most likely evolutionary tree.

Comparison

MLE is a point estimation method. It sits in a spectrum from pure data-driven (MLE) through prior-informed (MAP) to fully probabilistic (Bayesian). It also contrasts with non-parametric methods that avoid distributional assumptions entirely.

MAP (Maximum A Posteriori)

Similarity

Also maximizes an objective derived from Bayes theorem; uses the same data likelihood

Key Difference

Adds log p(θ) (log prior) to the objective. Equivalent to regularized MLE. Gives a point estimate like MLE but is pulled toward the prior.

Choose When

When you have prior knowledge or small data and want regularization without full Bayesian inference.

Full Bayesian Inference

Similarity

Uses the same likelihood function p(data|θ); MLE is the mode of the posterior with flat prior

Key Difference

Computes the full posterior p(θ|data) rather than a point estimate. Provides calibrated uncertainty. Requires specifying a prior and often intractable integrals (solved by MCMC or VI).

Choose When

When you need uncertainty estimates, have small data, or want to propagate uncertainty through predictions.

Method of Moments

Similarity

Also produces point estimates for parametric models

Key Difference

Matches sample moments (mean, variance) to theoretical moments rather than maximizing likelihood. Computationally simpler but asymptotically less efficient than MLE.

Choose When

Quick baseline estimates or when the likelihood is intractable but moments are easily computed.

Least Squares (OLS)

Similarity

Identical to MLE for linear regression under Gaussian noise — they produce the same estimates

Key Difference

OLS minimizes residual sum of squares without a probabilistic framing. MLE shows WHY minimizing squared errors is the right thing to do (Gaussian noise assumption).

Choose When

OLS is fine for linear regression; MLE framing is needed when you want probabilities or non-Gaussian noise.

Criterion	MLE	MAP	Bayesian	Least Squares
Output	Point estimate θ̂	Point estimate θ̂	Full posterior p(θ\|data)	Point estimate θ̂
Regularization	None	Yes (via prior)	Yes (via prior)	Optional (ridge/lasso)
Small data behavior	Overfits	Better (shrinks)	Best (uncertainty)	Overfits
Uncertainty quantification	Approximate (CI)	Approximate (CI)	Exact (posterior)	Approximate (CI)
Computational cost	Low-Medium	Low-Medium	High (MCMC/VI)	Low
Distributional assumption	Required	Required + prior	Required + prior	Implicitly Gaussian

Choose Maximum Likelihood Estimation when:

You have abundant data, a well-specified parametric model, no strong prior knowledge, and need a computationally efficient point estimate with theoretical optimality guarantees.

Evaluation

Log-Likelihood

Higher (less negative) is better. Measures how well the fitted model explains the training data. Can overfit — evaluate on held-out data.

Target: Compare across models on the same dataset — no absolute scale. AIC = −2ℓ + 2k penalizes complexity.

AIC (Akaike Information Criterion)

Model selection criterion balancing fit and complexity. k = number of parameters. Lower AIC is better. Penalizes overfitting by penalizing extra parameters.

Target: Used comparatively. ΔAIC < 2 is negligible difference; ΔAIC > 10 is strong evidence for the better model.

BIC (Bayesian Information Criterion)

Like AIC but with a stronger penalty for parameters (log(n) vs 2). More conservative — prefers simpler models, especially for large n.

Target: Prefer BIC when n is large and you want sparser models; prefer AIC when prediction accuracy matters most.

Likelihood Ratio Test Statistic

Tests whether a full model fits significantly better than a restricted (null) model. df = difference in number of parameters. Large LR → reject the null model.

Target: p < 0.05 or p < 0.01 depending on application. Requires the null model to be nested in the full model.

Perplexity (NLP)

Geometric mean of inverse per-token probability. Lower perplexity = model is less 'surprised' by data. Standard metric for language model evaluation.

Target: Depends heavily on vocabulary size and domain. Lower is always better. GPT-4 has perplexity < 10 on standard benchmarks.

Evaluation Process

01.Fit the MLE parameters on training data
02.Evaluate log-likelihood on a held-out test set (not training data!) — training log-likelihood always overestimates generalization
03.Compute AIC/BIC if comparing multiple model families of different complexity
04.Perform likelihood ratio tests for nested model comparisons
05.Check residuals or posterior predictive samples to validate the distributional assumption
06.For classification models (logistic regression), additionally evaluate accuracy, AUC-ROC, and calibration (reliability diagrams)

Evaluation Traps

▸Never compare log-likelihoods across different dataset sizes — they are not comparable (log-likelihood scales with n)
▸Evaluating on training data gives optimistically biased log-likelihood — always use held-out data or cross-validation
▸A higher log-likelihood doesn't mean the model assumption is correct — a misspecified model can have high likelihood by chance
▸Likelihood ratio tests are only valid for nested models — the LR statistic does not have a chi-squared distribution for non-nested models

Real-World Interpretation Example

You fit a Gaussian model to 1000 data points and get ℓ(θ̂) = −2103 on the test set. A competing Student-t model gives ℓ = −2085. The t-model has 1 extra parameter (degrees of freedom). AIC comparison: Gaussian AIC = 4208, t-model AIC = 4172. The t-model wins by ΔAIC = 36 — overwhelming evidence for the heavier-tailed model. Biological interpretation: your data has more extreme outliers than a Gaussian predicts.

Common Mistakes

Students

×Computing ∏ p(xᵢ|θ) numerically for large n and getting 0 due to underflow — always sum log-probabilities instead
×Thinking the likelihood L(θ) is a probability distribution over θ — it is not (it doesn't integrate to 1 over θ)
×Confusing MLE variance (divides by n) with unbiased sample variance (divides by n-1) — they are different estimators with different properties
×Assuming MLE always has a closed form — most interesting models (logistic regression, neural networks) require iterative optimization
×Forgetting to check whether the stationary point of ℓ(θ) is a maximum, not a minimum or saddle point

Developers

×Using sklearn's LogisticRegression without realizing it defaults to C=1 (L2-regularized MAP), not pure MLE
×Not working in log-space when implementing likelihood functions — numerical underflow kills probabilities silently
×Failing to add gradient clipping when implementing gradient ascent on likelihood — log can blow up near p=0
×Applying MLE without considering whether the distributional assumption fits the data (e.g., using Gaussian when data is heavily skewed)
×Not checking for class imbalance before MLE training — the loss will be dominated by the majority class

In Interviews

×Claiming cross-entropy 'comes from information theory' without connecting it to Bernoulli/Categorical MLE — both explanations are valid, but interviewers often want the probabilistic derivation
×Saying MLE variance is unbiased — it's not (it divides by n, not n-1). The biased nature is a common interview trap
×Not being able to state what distributional assumption underlies MSE (Gaussian noise) or cross-entropy (Bernoulli/Categorical)
×Confusing consistency (convergence to true value) with unbiasedness (zero expected error for any n) — MLE variance is consistent but biased
×Unable to explain why we take the log — the product underflows AND differentiating sums is much easier than products

Real Projects

×Ignoring model misspecification: fitting a Gaussian to log-normal data and reporting mean/variance as if they're meaningful
×Using pure MLE (C=1e9 in sklearn) on small or noisy datasets without regularization — overfitting is severe
×Not validating distributional assumptions with goodness-of-fit tests or visual checks before using MLE estimates for downstream decisions
×Forgetting that likelihood ratio tests require NESTED models — comparing Gaussian vs Gamma with LRT gives invalid p-values
×Treating MLE point estimates as certain in downstream computations without propagating estimation uncertainty

Core ML Thinking Lens

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

MLE finds θ that maximizes the probability of observed data: θ̂ = argmax_θ Σᵢ log p(xᵢ|θ)
Always work in log-space: products → sums, prevents underflow, simplifies differentiation
Gaussian MLE: μ̂ = x̄, σ̂² = (1/n)Σ(xᵢ-x̄)² — note 1/n not 1/(n-1), so σ̂² is biased
Bernoulli MLE: p̂ = (number of 1s) / n — the observed frequency
MSE = negative log-likelihood under Gaussian noise; cross-entropy = under Bernoulli/Categorical
MLE has no closed form for logistic regression — gradient ascent on the concave log-likelihood
MAP = MLE + log prior. Gaussian prior → L2 regularization. Laplace prior → L1 regularization
Asymptotic properties: consistent (θ̂ → θ*), efficient (achieves Cramér-Rao bound), asymptotically normal
Likelihood is a function of θ, NOT a probability distribution over θ
Small data: MLE overfits. Large data: MLE is optimal. Always regularize with MAP for small n

Critical Formulas

MLE objective

Gaussian MLE mean

Gaussian MLE variance (biased)

Bernoulli MLE

NLL = Loss

Logistic regression gradient

MAP objective

Cramér-Rao bound

Best For

✓Large datasets where asymptotic efficiency matters
✓Deriving and understanding loss functions from first principles
✓Fitting standard parametric distributions (Gaussian, Bernoulli, Poisson)
✓Any model where a distributional assumption is justified by domain knowledge
✓Foundation for gradient-based training of probabilistic models

Avoid When

✗Small datasets without regularization
✗Model family is clearly misspecified
✗You need full uncertainty over parameters (use Bayesian inference)
✗Severe class imbalance without reweighting

Interview Must-Know

★MSE = Gaussian MLE; Cross-entropy = Bernoulli/Categorical MLE

★Why log-likelihood: products underflow, sums are easier to differentiate, monotone so same argmax

★MLE Gaussian variance divides by n (biased), unbiased sample variance divides by n-1

★MAP = MLE + prior; L2 regularization = Gaussian prior; L1 = Laplace prior

★Logistic regression log-likelihood is globally concave → gradient ascent finds global optimum

★MLE is consistent, asymptotically efficient, asymptotically normal

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.