In Plain English
MLE is a method for fitting a model to data. Given a family of probability distributions parameterized by θ, MLE finds the specific θ that assigns the highest probability to the data you actually observed. If you're fitting a Gaussian and you observed data clustered around 5, MLE will tell you to set the mean to 5 — because that's the parameter that makes the observed data most probable.
Why It Exists
We always need to fit models to data. The question is: what criterion should we use to decide which parameters are 'best'? MLE gives a principled probabilistic answer: the best parameters are those under which the observed data is most likely. This framework is elegant, general, and connects naturally to information theory and Bayesian inference. It's also what most ML loss functions secretly are.
Problem It Solves
Given n observed data points x₁, ..., xₙ assumed to be drawn i.i.d. from a distribution p(x|θ), find the parameter vector θ̂ that maximizes the probability of the observed data: θ̂_MLE = argmax_θ ∏ᵢ p(xᵢ|θ).
Real-Life Analogy
"Imagine you're a detective who finds a coin that came out heads 7 times in 10 flips. You want to estimate how biased the coin is. MLE says: try every possible bias p ∈ [0, 1] and calculate the probability of observing exactly 7 heads. The value of p that makes this probability highest is your estimate. Unsurprisingly, that's p = 0.7 — because no other value of p would make '7 heads in 10 flips' more likely than p = 0.7 does."
When To Use
- You have a parametric model family and want to fit its parameters to data
- Your data is plentiful — MLE is asymptotically optimal with large datasets
- You want to derive or understand what a loss function actually represents
- Building probabilistic models that output calibrated probabilities
- Connecting model training to information-theoretic principles like KL divergence
When NOT To Use
- Your dataset is very small — MLE has no built-in regularization and will overfit
- Your model family is misspecified and doesn't match the true data-generating process
- You have strong prior knowledge about plausible parameter values — use MAP or full Bayes instead
- You need uncertainty estimates over θ itself — MLE gives a point estimate, not a posterior distribution
- Class imbalance is severe — MLE will skew toward the majority class without correction
The fundamental idea of MLE is a reversal of perspective. Normally, if you know the parameters of a distribution, you can compute probabilities. MLE asks the inverse question: given data that you already observed, which parameters would have made this data most likely to appear? You treat the data as fixed and the parameters as the unknown, then search over parameter space to maximize the probability of what you saw.
The likelihood function L(θ) = ∏ᵢ p(xᵢ|θ) is the joint probability of all data points, viewed as a function of θ rather than a function of x. When data points are independent, the joint probability is the product of individual probabilities. This product can get astronomically small for large datasets, so we immediately convert to log-likelihood: ℓ(θ) = Σᵢ log p(xᵢ|θ). Because log is monotonically increasing, maximizing ℓ(θ) finds exactly the same θ as maximizing L(θ).
The connection to ML loss functions is not a coincidence — it is by design. Mean squared error emerges from assuming Gaussian noise in your targets. Cross-entropy loss emerges from assuming Bernoulli (binary) or Categorical (multi-class) output distributions. When you minimize a loss function in training, you are almost certainly maximizing a log-likelihood under some distributional assumption about your data. Understanding this connection lets you reason about what assumptions your model is implicitly making and whether they are appropriate.
MLE has powerful asymptotic guarantees: as n → ∞, the MLE estimator converges to the true parameter (consistency), achieves the lowest possible variance among unbiased estimators (efficiency, given by the Cramér-Rao bound), and becomes approximately Gaussian distributed around the true value (asymptotic normality). These properties make MLE the gold standard for large-data estimation. With small data, however, MLE can overfit dramatically — a coin that lands heads 3 out of 3 times gets MLE estimate p̂ = 1.0, which is almost certainly wrong.
The Metaphor
"MLE is like tuning a radio. The radio has a tuning knob (the parameter θ) and you're hearing a noisy signal (your data). You turn the knob across all stations and listen to which station setting makes the signal you're receiving most coherent — most likely to have produced those exact audio samples. You stop at the station that best explains what you're hearing. You're not asking what station is playing; you're asking which station setting best accounts for the sound you already heard."
Beginner Mental Model
Pick a distribution family. Imagine running the data-generating process under many different parameter values. For each θ, ask: 'If this were the true θ, how probable would my observed dataset be?' Compute that probability for every candidate θ. The winner is θ̂_MLE. In practice, you take the log, write out the sum, differentiate, set to zero, and solve — which for many distributions gives a clean closed-form formula.
Formal Definition
Given n i.i.d. observations x₁, ..., xₙ from a parametric family p(x|θ), the MLE is θ̂ = argmax_θ L(θ) = argmax_θ ∏ᵢ₌₁ⁿ p(xᵢ|θ) = argmax_θ Σᵢ₌₁ⁿ log p(xᵢ|θ). Equivalently, θ̂ = argmin_θ −(1/n) Σᵢ log p(xᵢ|θ), connecting to empirical risk minimization. The score function ∇_θ log p(xᵢ|θ) and Fisher information I(θ) = E[(∇_θ log p(x|θ))²] characterize the curvature of the likelihood.
Key Terms
- Likelihood function L(θ)
- L(θ) = ∏ᵢ p(xᵢ|θ). The joint probability of the observed data, treated as a function of the parameters θ with the data held fixed. It is NOT a probability distribution over θ — it doesn't integrate to 1 over θ space. It is a function that measures how compatible θ is with the observed data.
- Log-likelihood ℓ(θ)
- ℓ(θ) = log L(θ) = Σᵢ log p(xᵢ|θ). The natural log of the likelihood. Converts the product of probabilities (which underflows to zero for large n) into a sum of log-probabilities. Monotonic transformation preserves the argmax. Almost always easier to differentiate analytically.
- Score function
- s(θ) = ∇_θ log p(x|θ). The gradient of the log-likelihood with respect to parameters. At the MLE, the sum of scores equals zero: Σᵢ s(xᵢ; θ̂) = 0. This is the first-order optimality condition. Its expectation under the true distribution is zero: E[s(θ)] = 0.
- Fisher Information I(θ)
- I(θ) = E[(∇_θ log p(x|θ))²] = −E[∇²_θ log p(x|θ)]. Measures how much information a single observation carries about θ. Also equals the expected curvature of the log-likelihood. The Cramér-Rao bound states Var(θ̂) ≥ 1/(n·I(θ)), and MLE achieves this bound asymptotically.
- Sufficient statistic
- A function T(x) of the data is sufficient for θ if the likelihood factors as L(θ) = g(T(x), θ)·h(x) — the dependence on θ comes only through T(x). Knowing T(x) captures everything the data says about θ. For Gaussian data, (x̄, s²) are sufficient for (μ, σ²). For Bernoulli, Σxᵢ is sufficient for p.
- MLE vs MAP
- Maximum A Posteriori (MAP) estimation adds a prior: θ̂_MAP = argmax_θ [log p(θ) + Σᵢ log p(xᵢ|θ)]. MLE is MAP with a flat (uniform) prior. A Gaussian prior on θ corresponds to L2 regularization. A Laplace prior corresponds to L1 regularization. MAP shrinks estimates toward the prior mean.
- Asymptotic normality
- For large n, the MLE is approximately Gaussian: √n(θ̂ − θ*) →_d N(0, I(θ*)⁻¹), where θ* is the true parameter. This result enables confidence intervals and hypothesis tests based on MLE estimates. The approximation improves as n grows.
- Consistency
- θ̂_MLE → θ* in probability as n → ∞ (under regularity conditions). The estimator converges to the true parameter as data increases. This is a minimal sanity property — an estimator that doesn't converge to the truth with infinite data is useless.
Step-by-Step Working
- Step 1 — Specify a model: Choose a parametric family for your data, e.g., p(xᵢ|θ). This is the distributional assumption (Gaussian, Bernoulli, Poisson, etc.).
- Step 2 — Write the likelihood: Under i.i.d. assumption, L(θ) = ∏ᵢ₌₁ⁿ p(xᵢ|θ).
- Step 3 — Take the log: ℓ(θ) = Σᵢ log p(xᵢ|θ). Products become sums, making differentiation tractable.
- Step 4 — Differentiate: Compute ∂ℓ/∂θ. Set each component of the gradient to zero: ∇_θ ℓ(θ) = 0.
- Step 5 — Solve: If a closed-form solution exists (Gaussian, Bernoulli, Poisson), solve the resulting equations. If not (logistic regression, neural networks), use gradient ascent on ℓ(θ).
- Step 6 — Verify it's a maximum: Check that the second derivative (Hessian) is negative definite at the solution, confirming a maximum rather than a saddle point or minimum.
- Step 7 — Report and interpret: θ̂_MLE is your parameter estimate. For large n, standard errors can be computed as √(I(θ̂)⁻¹/n).
Inputs
A dataset of n observations {x₁, ..., xₙ}, and a choice of parametric probability distribution family p(x|θ) with parameter(s) θ.
Outputs
θ̂_MLE: the parameter estimate(s) that maximize the log-likelihood. For Gaussian: (μ̂, σ̂²). For Bernoulli: p̂. For logistic regression: the weight vector ŵ.
Model Assumptions
Important Edge Cases
- ▸Small n: MLE overfits. A coin with 3/3 heads gets p̂ = 1.0, ruling out tails forever.
- ▸Unbounded likelihood: for Gaussian mixture models, likelihood → ∞ as a component's variance → 0 and a single point sits at its mean. The MLE is degenerate.
- ▸Non-identifiability: multiple θ values give the same distribution. The MLE is not unique (e.g., overparameterized neural networks).
- ▸Misspecified model: MLE still converges, but to the θ that minimizes KL divergence from the true distribution to the model family — not to any 'true θ' per se.
- ▸Numerical underflow: computing ∏ᵢ p(xᵢ|θ) directly for large n always underflows. Always work in log space.
Role in the ML Pipeline
MLE is the theoretical foundation of the training step. When you define a loss function and run gradient descent, you are almost always performing MLE (or MAP). Understanding MLE lets you design custom loss functions, diagnose training failures, and know what distributional assumptions your model makes.
Data Preprocessing
- 01.Ensure data is representative of the target distribution — MLE fits the distribution of training data, so biased training data yields biased estimates
- 02.Handle missing data before computing likelihoods, as missing values require special treatment (EM algorithm or imputation)
- 03.For continuous features, check whether the chosen parametric family (e.g., Gaussian) is approximately correct using histograms or QQ-plots
- 04.Normalize features when using iterative MLE (gradient ascent) to ensure stable convergence
- 05.For classification, check class balance — severe imbalance will push MLE estimates toward predicting the majority class
Training Process
- 01.For closed-form MLE (Gaussian, Bernoulli, Poisson): compute sufficient statistics from data and plug into the formula
- 02.For iterative MLE (logistic regression, neural networks): initialize parameters, compute gradient ∇_θ ℓ(θ), update θ ← θ + α·∇_θ ℓ(θ), repeat until convergence
- 03.Monitor log-likelihood during training — it should increase monotonically (for exact gradient ascent) or on average (for stochastic methods)
- 04.Check convergence: gradient norm ‖∇ℓ‖ < ε, or relative change in ℓ < ε
- 05.For well-specified models with large n, the MLE should be very close to the true parameters
Implementation Checklist
- 1
1. Choose and justify your distributional assumption - 2
2. Write out log p(xᵢ|θ) symbolically for one data point - 3
3. Sum over i to get ℓ(θ) - 4
4. Differentiate ∂ℓ/∂θ and set to zero - 5
5. Solve analytically OR implement gradient ascent - 6
6. Validate: plug θ̂ back in, check ℓ(θ̂) is indeed high; verify on held-out data
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ─── 1. MLE for Gaussian ─────────────────────────────────────────────────────
6np.random.seed(42)
7true_mu, true_sigma = 5.0, 2.0
8data = np.random.normal(true_mu, true_sigma, size=200)
9
10# Closed-form MLE
11mu_hat = np.mean(data) # MLE mean
12sigma2_hat = np.mean((data - mu_hat) ** 2) # MLE variance (biased, divides by n)
13sigma_hat = np.sqrt(sigma2_hat)
14
15print("=== Gaussian MLE ===")
16print(f"True μ: {true_mu:.3f} | MLE μ̂: {mu_hat:.3f}")
17print(f"True σ: {true_sigma:.3f} | MLE σ̂: {sigma_hat:.3f}")
18
19# Compare with scipy MLE (identical)
20mu_scipy, sigma_scipy = stats.norm.fit(data)
21print(f"scipy MLE μ̂: {mu_scipy:.3f}, σ̂: {sigma_scipy:.3f}")
22
23# ─── 2. Log-likelihood as a function of μ ────────────────────────────────────
24mu_grid = np.linspace(3, 7, 300)
25log_likelihoods = [np.sum(stats.norm.logpdf(data, loc=m, scale=sigma_hat))
26 for m in mu_grid]
27
28# The peak of this curve is the MLE
29peak_mu = mu_grid[np.argmax(log_likelihoods)]
30print(f"\nPeak of log-likelihood curve at μ = {peak_mu:.3f}")
31
32# ─── 3. MLE for Bernoulli ─────────────────────────────────────────────────────
33true_p = 0.7
34flips = np.random.binomial(1, true_p, size=50)
35
36p_hat = np.mean(flips) # MLE = observed frequency of 1s
37print(f"\n=== Bernoulli MLE ===")
38print(f"True p: {true_p:.2f} | MLE p̂: {p_hat:.3f}")
39
40# Visualize likelihood as a function of p
41p_grid = np.linspace(0.01, 0.99, 300)
42n_heads = flips.sum()
43n_total = len(flips)
44log_lik_bern = n_heads * np.log(p_grid) + (n_total - n_heads) * np.log(1 - p_grid)
45print(f"Log-likelihood maximized at p = {p_grid[np.argmax(log_lik_bern)]:.3f}")
46
47# ─── 4. MLE for Logistic Regression (gradient ascent) ─────────────────────────
48def sigmoid(z):
49 return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
50
51def log_likelihood(w, X, y):
52 """Bernoulli log-likelihood under logistic model."""
53 p = sigmoid(X @ w)
54 # Clamp to avoid log(0)
55 p = np.clip(p, 1e-12, 1 - 1e-12)
56 return np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
57
58def gradient(w, X, y):
59 """∇_w ℓ = Xᵀ(y − σ(Xw))"""
60 p = sigmoid(X @ w)
61 return X.T @ (y - p)
62
63# Generate synthetic binary classification data
64np.random.seed(0)
65n_samples = 300
66X_raw = np.random.randn(n_samples, 2)
67true_w = np.array([2.0, -1.5])
68y = (sigmoid(X_raw @ true_w) > 0.5).astype(float)
69
70# Add bias column
71X = np.column_stack([np.ones(n_samples), X_raw])
72true_w_full = np.array([0.0, 2.0, -1.5])
73
74# Gradient ascent (maximizing log-likelihood)
75w = np.zeros(3)
76lr = 0.01
77history = []
78
79for iteration in range(500):
80 ll = log_likelihood(w, X, y)
81 history.append(ll)
82 w += lr * gradient(w, X, y)
83
84print(f"\n=== Logistic Regression MLE ===")
85print(f"Final log-likelihood: {history[-1]:.3f}")
86print(f"MLE weights: {w}")
87print(f"True weights: {true_w_full}")
88
89# ─── 5. Connection: MSE = Gaussian NLL ────────────────────────────────────────
90# For regression with Gaussian noise y = wᵀx + ε, ε ~ N(0, σ²):
91# NLL ∝ Σ(yᵢ - wᵀxᵢ)² = MSE × n
92# Minimizing MSE is exactly MLE under Gaussian noise assumption
93
94X_reg = np.random.randn(100, 1)
95y_reg = 3 * X_reg.squeeze() + np.random.randn(100)
96
97# MLE via normal equations (= least squares)
98X_aug = np.column_stack([np.ones(100), X_reg])
99w_mle = np.linalg.lstsq(X_aug, y_reg, rcond=None)[0]
100print(f"\n=== Linear Regression MLE (= Least Squares) ===")
101print(f"MLE weights: intercept={w_mle[0]:.3f}, slope={w_mle[1]:.3f} (true slope=3.0)")
102
103# ─── 6. MLE vs MAP with Gaussian Prior (= L2 regularization) ──────────────────
104def map_estimate(X, y, lam):
105 """MAP with Gaussian prior: (XᵀX + λI)⁻¹ Xᵀy"""
106 n_feat = X.shape[1]
107 return np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
108
109w_mle_lr = map_estimate(X_aug, y_reg, lam=0.0) # λ=0 → MLE
110w_map_lr = map_estimate(X_aug, y_reg, lam=10.0) # λ>0 → MAP (ridge)
111print(f"MLE: {w_mle_lr}")
112print(f"MAP: {w_map_lr} (shrunk toward zero by prior)")
113Sample Input
data = [4.8, 5.2, 3.9, 6.1, 5.0, 4.5, 5.7, 4.3, 5.5, 5.1] # n=10 observations
Sample Output
Gaussian MLE: μ̂ = 5.01, σ̂ = 0.619 (MLE, divides by n=10) Sample std: 0.653 (unbiased, divides by n-1=9) Log-likelihood at MLE: -9.32 Log-likelihood at μ=4.0: -12.84 (worse)
Key Implementation Insights
- →MSE and cross-entropy are not arbitrary choices — they follow necessarily from assuming Gaussian and Bernoulli/Categorical noise respectively. Choosing a different noise model gives a different loss function.
- →Always work in log-likelihood space. Never multiply raw probabilities for n > 50 — numerical underflow will give exactly 0.
- →For logistic regression, the log-likelihood is globally concave (the Hessian is negative semi-definite), guaranteeing gradient descent finds the global optimum.
- →scikit-learn's LogisticRegression defaults to C=1 (L2 regularization), which is MAP, not pure MLE. Set C=1e9 to approximate MLE.
- →scipy.stats distribution .fit() methods perform MLE — they're a fast way to fit standard parametric distributions.
- →The MLE for the Gaussian variance divides by n, not n−1. The n−1 version (unbiased sample variance) is NOT the MLE — knowing this difference matters in interviews.
Common Implementation Mistakes
- ✗Computing the product of probabilities directly for large n — always use log-likelihood (sum of log-probabilities)
- ✗Confusing MLE variance (1/n) with unbiased sample variance (1/(n-1)) — they are different; MLE divides by n
- ✗Assuming MLE always has a closed form — logistic regression, neural networks, and many other models require iterative optimization
- ✗Forgetting that sklearn's LogisticRegression is regularized by default (C=1) — you must set C very large to get pure MLE
- ✗Treating the likelihood function as a probability distribution over θ — it is not; it doesn't integrate to 1 over θ
- ✗Applying MLE to tiny datasets without regularization — p̂ = 1.0 for a coin that landed heads 3/3 times is MLE but a terrible estimate
Large tabular datasets
MLE is optimal for large samples. Asymptotic efficiency guarantees near-optimal estimates.
Small datasets (n < 100)
MLE overfits without regularization. A coin that lands heads 3/3 gets p̂=1.0.
Text / NLP
Cross-entropy loss (= MLE under Categorical) is the standard training objective for language models.
Imbalanced classification
MLE estimates reflect training class frequencies. On 99/1 imbalanced data, MLE skews strongly toward the majority class.
Continuous features (Gaussian model)
MLE for Gaussian gives exact closed-form estimators (mean and variance) in O(n) time.
Misspecified model
MLE still converges, but to the parameter minimizing KL divergence from truth to model, not the 'true' parameter.
Mandatory Visual Blueprint
What should move
At least one parameter, threshold, split, cluster state, or metric should change interactively.
What to observe
The learner should see how the concept affects error, fit, grouping, or decision quality.
Planned visual type
Interactive chart, step animation, or side-by-side failure-mode comparison.
Reference image slot
If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.
Topic key: mle
Log-Likelihood Surface for Gaussian Parameters
A 2D contour plot of ℓ(μ, σ²) over a grid of parameter values. The peak of the surface is at (x̄, σ̂²_MLE). The surface is concave, confirming a global maximum.
Likelihood vs Parameter: Bernoulli Coin
Plot of L(p) = p^k · (1−p)^(n−k) as a function of p ∈ (0,1), for k=7 heads in n=10 flips. The peak is at p=0.7, illustrating how MLE finds the most compatible parameter.
Gradient descent convergence — MSE decreasing over iterations
Log-Likelihood Increasing During Training (Logistic Regression)
Training curve showing log-likelihood vs gradient ascent iterations for logistic regression. Starts negative and large, increases and plateaus at the MLE.
Gradient descent convergence — MSE decreasing over iterations
MLE vs MAP: Effect of Prior Strength
Comparison of MLE and MAP estimates as a function of prior strength (λ). MLE is constant (no prior). MAP shrinks toward zero as λ increases. With large n, both converge to the truth.
Advantages
Principled and general
MLE works for any parametric distribution family — Gaussian, Bernoulli, Poisson, exponential, mixture models, etc. It provides a unified framework rather than an ad-hoc fitting rule.
Asymptotically optimal
For large datasets, MLE achieves the Cramér-Rao lower bound: no unbiased estimator can have lower variance. You cannot do better asymptotically.
Closed form for exponential families
For Gaussian, Bernoulli, Poisson, Exponential, and other exponential family distributions, MLE yields exact formulas expressible as simple functions of sufficient statistics. O(n) computation.
Reveals the meaning of loss functions
Understanding MLE explains why MSE and cross-entropy are the right choices for regression and classification respectively. It removes the 'magic' from loss function selection.
Invariance property
If θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g. This makes transformations natural — you don't need to re-derive the MLE for each parameterization.
Foundation for model comparison
Log-likelihood enables likelihood ratio tests, AIC/BIC model selection criteria, and cross-validated predictive performance comparisons — all built on the MLE foundation.
Limitations
Overfits on small datasets
MLE has no built-in regularization. With small n, estimates can be extreme (p̂=1.0 for 3/3 heads). MAP (MLE + prior) or full Bayesian inference is preferable for small data.
Sensitive to model misspecification
If the true distribution is not in your assumed model family, MLE converges to the 'wrong' parameter (the KL-minimizing one). Fitting a Gaussian to heavy-tailed data gives misleading estimates.
No uncertainty quantification over parameters
MLE gives a point estimate θ̂, not a posterior distribution p(θ|data). You don't know how confident to be in the estimate. Confidence intervals require additional asymptotic approximations.
Degeneracies in mixture models
Gaussian mixture model likelihood is unbounded — it blows up as a component's variance → 0 with one point at its center. MLE is not well-defined without constraints.
Ignores class imbalance
Pure MLE reflects the training distribution. On 99/1 imbalanced data, predictions are dominated by the majority class. Requires explicit reweighting to correct.
Requires choosing a distributional family
You must commit to a parametric family upfront. If the family is wrong, MLE cannot correct for it. Non-parametric methods avoid this commitment at the cost of higher variance.
Language model training
GPT, BERT, and all modern language models are trained by maximizing the likelihood of observed token sequences. The cross-entropy loss IS the negative log-likelihood under a Categorical distribution.
Risk model calibration
Fitting parametric distributions (log-normal, t-distribution) to asset return data using MLE to calibrate Value at Risk (VaR) and other risk metrics.
Survival analysis
MLE fits Weibull, exponential, or Cox proportional hazards models to time-to-event data, handling censoring naturally through the likelihood formulation.
Generative models
Variational Autoencoders (VAEs) maximize a lower bound on log-likelihood. Normalizing flows maximize exact likelihood. Both use MLE as the training criterion.
Signal parameter estimation
MLE estimates signal frequency, amplitude, and noise variance from received samples. The MLE is the standard estimator in radar, sonar, and communications systems.
Phylogenetic tree inference
Maximum likelihood phylogenetics (e.g., RAxML, IQ-TREE) fits evolutionary models to DNA sequence alignment data using MLE, producing the most likely evolutionary tree.
MLE is a point estimation method. It sits in a spectrum from pure data-driven (MLE) through prior-informed (MAP) to fully probabilistic (Bayesian). It also contrasts with non-parametric methods that avoid distributional assumptions entirely.
MAP (Maximum A Posteriori)
Similarity
Also maximizes an objective derived from Bayes theorem; uses the same data likelihood
Key Difference
Adds log p(θ) (log prior) to the objective. Equivalent to regularized MLE. Gives a point estimate like MLE but is pulled toward the prior.
Choose When
When you have prior knowledge or small data and want regularization without full Bayesian inference.
Full Bayesian Inference
Similarity
Uses the same likelihood function p(data|θ); MLE is the mode of the posterior with flat prior
Key Difference
Computes the full posterior p(θ|data) rather than a point estimate. Provides calibrated uncertainty. Requires specifying a prior and often intractable integrals (solved by MCMC or VI).
Choose When
When you need uncertainty estimates, have small data, or want to propagate uncertainty through predictions.
Method of Moments
Similarity
Also produces point estimates for parametric models
Key Difference
Matches sample moments (mean, variance) to theoretical moments rather than maximizing likelihood. Computationally simpler but asymptotically less efficient than MLE.
Choose When
Quick baseline estimates or when the likelihood is intractable but moments are easily computed.
Least Squares (OLS)
Similarity
Identical to MLE for linear regression under Gaussian noise — they produce the same estimates
Key Difference
OLS minimizes residual sum of squares without a probabilistic framing. MLE shows WHY minimizing squared errors is the right thing to do (Gaussian noise assumption).
Choose When
OLS is fine for linear regression; MLE framing is needed when you want probabilities or non-Gaussian noise.
| Criterion | MLE | MAP | Bayesian | Least Squares |
|---|---|---|---|---|
| Output | Point estimate θ̂ | Point estimate θ̂ | Full posterior p(θ|data) | Point estimate θ̂ |
| Regularization | None | Yes (via prior) | Yes (via prior) | Optional (ridge/lasso) |
| Small data behavior | Overfits | Better (shrinks) | Best (uncertainty) | Overfits |
| Uncertainty quantification | Approximate (CI) | Approximate (CI) | Exact (posterior) | Approximate (CI) |
| Computational cost | Low-Medium | Low-Medium | High (MCMC/VI) | Low |
| Distributional assumption | Required | Required + prior | Required + prior | Implicitly Gaussian |
Choose Maximum Likelihood Estimation when:
You have abundant data, a well-specified parametric model, no strong prior knowledge, and need a computationally efficient point estimate with theoretical optimality guarantees.
Log-Likelihood
Higher (less negative) is better. Measures how well the fitted model explains the training data. Can overfit — evaluate on held-out data.
Target: Compare across models on the same dataset — no absolute scale. AIC = −2ℓ + 2k penalizes complexity.
AIC (Akaike Information Criterion)
Model selection criterion balancing fit and complexity. k = number of parameters. Lower AIC is better. Penalizes overfitting by penalizing extra parameters.
Target: Used comparatively. ΔAIC < 2 is negligible difference; ΔAIC > 10 is strong evidence for the better model.
BIC (Bayesian Information Criterion)
Like AIC but with a stronger penalty for parameters (log(n) vs 2). More conservative — prefers simpler models, especially for large n.
Target: Prefer BIC when n is large and you want sparser models; prefer AIC when prediction accuracy matters most.
Likelihood Ratio Test Statistic
Tests whether a full model fits significantly better than a restricted (null) model. df = difference in number of parameters. Large LR → reject the null model.
Target: p < 0.05 or p < 0.01 depending on application. Requires the null model to be nested in the full model.
Perplexity (NLP)
Geometric mean of inverse per-token probability. Lower perplexity = model is less 'surprised' by data. Standard metric for language model evaluation.
Target: Depends heavily on vocabulary size and domain. Lower is always better. GPT-4 has perplexity < 10 on standard benchmarks.
Evaluation Process
- 01.Fit the MLE parameters on training data
- 02.Evaluate log-likelihood on a held-out test set (not training data!) — training log-likelihood always overestimates generalization
- 03.Compute AIC/BIC if comparing multiple model families of different complexity
- 04.Perform likelihood ratio tests for nested model comparisons
- 05.Check residuals or posterior predictive samples to validate the distributional assumption
- 06.For classification models (logistic regression), additionally evaluate accuracy, AUC-ROC, and calibration (reliability diagrams)
Evaluation Traps
- ▸Never compare log-likelihoods across different dataset sizes — they are not comparable (log-likelihood scales with n)
- ▸Evaluating on training data gives optimistically biased log-likelihood — always use held-out data or cross-validation
- ▸A higher log-likelihood doesn't mean the model assumption is correct — a misspecified model can have high likelihood by chance
- ▸Likelihood ratio tests are only valid for nested models — the LR statistic does not have a chi-squared distribution for non-nested models
Real-World Interpretation Example
You fit a Gaussian model to 1000 data points and get ℓ(θ̂) = −2103 on the test set. A competing Student-t model gives ℓ = −2085. The t-model has 1 extra parameter (degrees of freedom). AIC comparison: Gaussian AIC = 4208, t-model AIC = 4172. The t-model wins by ΔAIC = 36 — overwhelming evidence for the heavier-tailed model. Biological interpretation: your data has more extreme outliers than a Gaussian predicts.
Students
- ×Computing ∏ p(xᵢ|θ) numerically for large n and getting 0 due to underflow — always sum log-probabilities instead
- ×Thinking the likelihood L(θ) is a probability distribution over θ — it is not (it doesn't integrate to 1 over θ)
- ×Confusing MLE variance (divides by n) with unbiased sample variance (divides by n-1) — they are different estimators with different properties
- ×Assuming MLE always has a closed form — most interesting models (logistic regression, neural networks) require iterative optimization
- ×Forgetting to check whether the stationary point of ℓ(θ) is a maximum, not a minimum or saddle point
Developers
- ×Using sklearn's LogisticRegression without realizing it defaults to C=1 (L2-regularized MAP), not pure MLE
- ×Not working in log-space when implementing likelihood functions — numerical underflow kills probabilities silently
- ×Failing to add gradient clipping when implementing gradient ascent on likelihood — log can blow up near p=0
- ×Applying MLE without considering whether the distributional assumption fits the data (e.g., using Gaussian when data is heavily skewed)
- ×Not checking for class imbalance before MLE training — the loss will be dominated by the majority class
In Interviews
- ×Claiming cross-entropy 'comes from information theory' without connecting it to Bernoulli/Categorical MLE — both explanations are valid, but interviewers often want the probabilistic derivation
- ×Saying MLE variance is unbiased — it's not (it divides by n, not n-1). The biased nature is a common interview trap
- ×Not being able to state what distributional assumption underlies MSE (Gaussian noise) or cross-entropy (Bernoulli/Categorical)
- ×Confusing consistency (convergence to true value) with unbiasedness (zero expected error for any n) — MLE variance is consistent but biased
- ×Unable to explain why we take the log — the product underflows AND differentiating sums is much easier than products
Real Projects
- ×Ignoring model misspecification: fitting a Gaussian to log-normal data and reporting mean/variance as if they're meaningful
- ×Using pure MLE (C=1e9 in sklearn) on small or noisy datasets without regularization — overfitting is severe
- ×Not validating distributional assumptions with goodness-of-fit tests or visual checks before using MLE estimates for downstream decisions
- ×Forgetting that likelihood ratio tests require NESTED models — comparing Gaussian vs Gamma with LRT gives invalid p-values
- ×Treating MLE point estimates as certain in downstream computations without propagating estimation uncertainty
What kind of bias does this model have?
Bias depends on model assumptions and feature expressiveness.
What kind of variance does it have?
Variance grows with model flexibility and weak regularization.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers representative, low-leakage data with stable feature definitions.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- MLE finds θ that maximizes the probability of observed data: θ̂ = argmax_θ Σᵢ log p(xᵢ|θ)
- Always work in log-space: products → sums, prevents underflow, simplifies differentiation
- Gaussian MLE: μ̂ = x̄, σ̂² = (1/n)Σ(xᵢ-x̄)² — note 1/n not 1/(n-1), so σ̂² is biased
- Bernoulli MLE: p̂ = (number of 1s) / n — the observed frequency
- MSE = negative log-likelihood under Gaussian noise; cross-entropy = under Bernoulli/Categorical
- MLE has no closed form for logistic regression — gradient ascent on the concave log-likelihood
- MAP = MLE + log prior. Gaussian prior → L2 regularization. Laplace prior → L1 regularization
- Asymptotic properties: consistent (θ̂ → θ*), efficient (achieves Cramér-Rao bound), asymptotically normal
- Likelihood is a function of θ, NOT a probability distribution over θ
- Small data: MLE overfits. Large data: MLE is optimal. Always regularize with MAP for small n
Critical Formulas
Best For
- ✓Large datasets where asymptotic efficiency matters
- ✓Deriving and understanding loss functions from first principles
- ✓Fitting standard parametric distributions (Gaussian, Bernoulli, Poisson)
- ✓Any model where a distributional assumption is justified by domain knowledge
- ✓Foundation for gradient-based training of probabilistic models
Avoid When
- ✗Small datasets without regularization
- ✗Model family is clearly misspecified
- ✗You need full uncertainty over parameters (use Bayesian inference)
- ✗Severe class imbalance without reweighting
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.