ML Atlas

Evaluation Metrics

Choose the wrong metric and your model is optimizing the wrong problem.

BeginnerEvaluation
30 min read
Understanding of classification and regression tasksBasic probability and statistics (distributions, expected value)Familiarity with confusion matrices
  • Medical diagnosis models optimized for recall — missing cancer is worse than a false alarm
  • Spam filters using precision to avoid blocking legitimate emails
  • Credit card fraud detection using PR-AUC because positive class (fraud) is rare
  • Recommendation systems evaluated with NDCG and MAP rather than accuracy
  • Object detection models scored with mean Average Precision (mAP) across IoU thresholds
01

In Plain English

Evaluation metrics are the yardsticks that tell you how well your model performs. Different problems have fundamentally different notions of 'good' — a medical test that misses 50% of cancer cases is catastrophic even if it's 99% accurate on healthy patients. Choosing the right metric is as important as choosing the right model.

Why It Exists

Accuracy is misleading for imbalanced classes, MSE is distorted by outliers, and ROC-AUC ignores class prevalence. Each metric reveals a different aspect of model behavior, and every real problem has a cost structure that determines which failures matter most.

Problem It Solves

Summarizing complex model behavior — a distribution of errors across thousands of predictions — into a single number (or a few numbers) that guides model selection, hyperparameter tuning, and business decision-making.

Real-Life Analogy

"Evaluating a student with only one test score is like using accuracy on an imbalanced dataset — it misses crucial nuance. A student might ace easy questions (98% of the test) while completely failing the hard ones that actually matter. Similarly, a fraud detector might score 99.8% accuracy by labeling everything as 'not fraud' — perfect at easy cases, catastrophic at the important ones."

When To Use

  • Accuracy: balanced classes, equal cost of false positives and false negatives
  • Precision: cost of false positives is high (spam filter, content moderation)
  • Recall: cost of false negatives is high (disease detection, security systems)
  • F1: when you need balance and classes are imbalanced
  • ROC-AUC: comparing model discrimination ability across all thresholds, threshold-invariant
  • PR-AUC: imbalanced datasets where positive class performance matters most
  • MSE/RMSE: regression, when large errors should be penalized more
  • MAE: regression, when you want interpretable average error robust to outliers

When NOT To Use

  • Accuracy for imbalanced classification — a 99% negative class gives 99% accuracy by predicting all negatives
  • ROC-AUC when class imbalance is severe — PR-AUC is more informative
  • RMSE when outliers are expected and you care about median error — use MAE
  • R² alone for regression evaluation — it hides scale and systematic bias
  • F1 when the costs of FP and FN are very different — use weighted F-beta score
02

Every classification model outputs a number (probability or score) for each sample. Before computing any metric, you apply a threshold to convert scores into binary predictions. Most metrics (accuracy, precision, recall, F1) depend on where you set this threshold. ROC-AUC and PR-AUC are threshold-agnostic — they summarize performance across all possible thresholds.

The confusion matrix is the foundation of all classification metrics. It's a 2×2 table: True Positives (correctly predicted positive), True Negatives (correctly predicted negative), False Positives (predicted positive, actually negative), and False Negatives (predicted negative, actually positive). Every classification metric is a function of these four numbers.

For regression, the fundamental trade-off is between MSE (which squares errors, heavily penalizing large misses) and MAE (which sums absolute errors, treating all misses proportionally). Your choice should reflect whether large errors are disproportionately costly in your application.

The Metaphor

"Think of a security guard checking bags. TP = correctly flagged a suspicious bag. TN = correctly passed a safe bag. FP = falsely flagged a safe bag (annoying, delays). FN = missed a truly dangerous bag (catastrophic). Precision measures: of bags you flagged, how many were actually dangerous? Recall measures: of all dangerous bags, how many did you catch? A strict guard (low threshold, flag everything) has high recall but low precision. A lenient guard has high precision but low recall. F1 finds the balance."

Beginner Mental Model

For classification: start with the confusion matrix (four cells: TP, TN, FP, FN). Every metric flows from these. Precision = TP/(TP+FP) = 'when I predict positive, am I right?'. Recall = TP/(TP+FN) = 'of all actual positives, did I find them?'. For regression: RMSE is the 'average error in original units, with extra penalty for big mistakes.' MAE is just the plain average absolute mistake.

03

For binary classification with predictions ŷᵢ ∈ {0,1} and ground truth yᵢ ∈ {0,1}: the confusion matrix defines TP = Σ𝟏[ŷᵢ=1, yᵢ=1], FP = Σ𝟏[ŷᵢ=1, yᵢ=0], FN = Σ𝟏[ŷᵢ=0, yᵢ=1], TN = Σ𝟏[ŷᵢ=0, yᵢ=0]. All classification metrics are derived from these counts. For regression with predictions ŷᵢ ∈ ℝ, metrics measure deviations between ŷᵢ and yᵢ under different loss functions.

True Positive (TP)
A positive sample correctly predicted as positive. In disease detection: sick patient correctly identified as sick.
True Negative (TN)
A negative sample correctly predicted as negative. In disease detection: healthy patient correctly identified as healthy.
False Positive (FP)
A negative sample incorrectly predicted as positive. Type I error. In disease detection: healthy patient falsely flagged as sick.
False Negative (FN)
A positive sample incorrectly predicted as negative. Type II error. In disease detection: sick patient falsely cleared as healthy.
Precision
Of all samples predicted positive, what fraction truly are positive? TP/(TP+FP). High precision = few false alarms.
Recall (Sensitivity, TPR)
Of all truly positive samples, what fraction did the model find? TP/(TP+FN). High recall = few misses.
Specificity (TNR)
Of all truly negative samples, what fraction did the model correctly identify as negative? TN/(TN+FP). The 'recall for the negative class.'
ROC Curve
Receiver Operating Characteristic. A curve plotting TPR (recall) on Y-axis vs FPR (= 1 - specificity) on X-axis at every possible threshold. AUC is the area under this curve.
PR Curve
Precision-Recall curve. Plots precision on Y-axis vs. recall on X-axis at every possible threshold. More informative than ROC for severely imbalanced datasets.
AUC (Area Under Curve)
Area under the ROC curve. Equals the probability that the model ranks a random positive sample higher than a random negative sample. Perfect model: AUC=1. Random classifier: AUC=0.5.
  1. 1. Identify the problem type: binary classification, multiclass, regression, ranking.
  2. 2. Understand the business cost structure: what's the relative cost of FP vs. FN?
  3. 3. Check class balance: if positive class < 10% of data, avoid accuracy and ROC-AUC as primary metrics.
  4. 4. Choose primary metric aligned with costs: precision (FP costly), recall (FN costly), F1 (balanced), PR-AUC (imbalanced).
  5. 5. Choose secondary metrics to give additional perspective (e.g., RMSE + MAE for regression).
  6. 6. Apply threshold tuning: find the decision threshold that optimizes your primary metric on the validation set.
  7. 7. Report metric on held-out test set — never tune threshold on the test set.

For classification: predicted class labels or probability scores + ground truth labels. For regression: predicted continuous values + ground truth values.

Scalar metric value(s) summarizing model performance. For curve-based metrics (ROC, PR): a list of (threshold, metric) pairs forming a curve, plus the area under it.

01Binary metrics (precision, recall) assume a fixed decision threshold applied to model scores.
02AUC assumes the model produces calibrated or at least orderable probability estimates.
03R² assumes you are comparing to a baseline of predicting the mean — negative R² means worse than this baseline.
04Macro-averaged multiclass metrics weight all classes equally; micro-averaged metrics weight by class frequency.
  • Precision is undefined when TP+FP=0 (model never predicts positive). Set to 0 or handle separately.
  • Recall is undefined when TP+FN=0 (no actual positives in the dataset). Indicates wrong data split.
  • F1 = 0 when precision = 0 or recall = 0 — the model has completely failed on one end.
  • R² can be negative — model worse than always predicting the mean. Does not mean R² is unbounded; minimum is -∞.
  • AUC = 0.5 for a random classifier exactly when positive and negative score distributions overlap completely.
04

Evaluation metrics are applied after model training and prediction. In a proper ML pipeline: train on training set → predict on validation set → compute metrics → tune hyperparameters → final evaluation on held-out test set. Metrics guide every iteration of model development.

  • 01.Ensure labels are correctly encoded: binary (0/1), multiclass (0,1,2,...), or continuous for regression.
  • 02.For imbalanced datasets: stratify train/test splits to maintain class proportions in each split.
  • 03.Check for label noise — mislabeled samples inflate FP/FN counts and distort metrics.
  • 04.For regression: check target distribution. Highly skewed y may make RMSE misleading (dominated by tail).
  • 01.Train model on training set. Tune hyperparameters using validation metrics (not test metrics).
  • 02.For classification: use predict_proba() to get probability scores, then sweep thresholds to plot ROC/PR curves.
  • 03.Apply threshold selection based on the business cost function (e.g., maximize F1 or recall @ precision > 0.8).
  • 04.For multiclass: decide averaging strategy (macro, micro, weighted) before evaluation.
  • 05.Report final metrics on the held-out test set exactly once — no further tuning after seeing test metrics.

Decision Threshold

The probability cutoff that converts model scores into binary predictions.

0.5 by default; often tuned to 0.3–0.7 depending on FP/FN cost ratio

Averaging method (multiclass)

How to aggregate per-class metrics in multiclass settings: macro (equal weight per class), micro (weight by frequency), weighted (weight by support).

Weighted F1 for imbalanced multiclass; macro for equal-class treatment

  1. 1from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score
  2. 2Generate predictions: y_prob = model.predict_proba(X_test)[:, 1]; y_pred = (y_prob >= threshold).astype(int)
  3. 3Compute confusion matrix: cm = confusion_matrix(y_test, y_pred)
  4. 4Print full report: classification_report(y_test, y_pred) — gives precision, recall, F1 per class
  5. 5Plot ROC curve: fpr, tpr, _ = roc_curve(y_test, y_prob); plt.plot(fpr, tpr)
  6. 6Plot PR curve: prec, rec, _ = precision_recall_curve(y_test, y_prob); plt.plot(rec, prec)
  7. 7Tune threshold: find optimal threshold from validation PR curve before applying to test set
05
06
python
1import numpy as np
2
3# ── Classification Metrics from Scratch ───────────────────────────────────────
4def confusion_matrix_counts(y_true, y_pred):
5    """Return TP, FP, FN, TN for binary classification."""
6    y_true, y_pred = np.array(y_true), np.array(y_pred)
7    TP = int(((y_pred == 1) & (y_true == 1)).sum())
8    FP = int(((y_pred == 1) & (y_true == 0)).sum())
9    FN = int(((y_pred == 0) & (y_true == 1)).sum())
10    TN = int(((y_pred == 0) & (y_true == 0)).sum())
11    return TP, FP, FN, TN
12
13def accuracy(y_true, y_pred):
14    return np.mean(np.array(y_true) == np.array(y_pred))
15
16def precision(y_true, y_pred):
17    TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
18    return TP / (TP + FP) if (TP + FP) > 0 else 0.0
19
20def recall(y_true, y_pred):
21    TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
22    return TP / (TP + FN) if (TP + FN) > 0 else 0.0
23
24def f1_score(y_true, y_pred):
25    P = precision(y_true, y_pred)
26    R = recall(y_true, y_pred)
27    return 2 * P * R / (P + R) if (P + R) > 0 else 0.0
28
29def fbeta_score(y_true, y_pred, beta):
30    P = precision(y_true, y_pred)
31    R = recall(y_true, y_pred)
32    denom = beta**2 * P + R
33    return (1 + beta**2) * P * R / denom if denom > 0 else 0.0
34
35def roc_auc(y_true, y_scores):
36    """Compute AUC via the Mann-Whitney U statistic (exact, no sorting trick)."""
37    y_true, y_scores = np.array(y_true), np.array(y_scores)
38    pos = y_scores[y_true == 1]
39    neg = y_scores[y_true == 0]
40    # Count pairs where positive score > negative score
41    n_pos, n_neg = len(pos), len(neg)
42    if n_pos == 0 or n_neg == 0:
43        return float('nan')
44    # Broadcasting: (n_pos, 1) vs (1, n_neg)
45    wins = (pos[:, None] > neg[None, :]).sum()
46    ties = (pos[:, None] == neg[None, :]).sum()
47    return (wins + 0.5 * ties) / (n_pos * n_neg)
48
49# ── Regression Metrics from Scratch ───────────────────────────────────────────
50def mse(y_true, y_pred):
51    return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)
52
53def rmse(y_true, y_pred):
54    return np.sqrt(mse(y_true, y_pred))
55
56def mae(y_true, y_pred):
57    return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))
58
59def r2(y_true, y_pred):
60    y_true, y_pred = np.array(y_true), np.array(y_pred)
61    ss_res = np.sum((y_true - y_pred) ** 2)
62    ss_tot = np.sum((y_true - y_true.mean()) ** 2)
63    return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
64
65# ── Demo ───────────────────────────────────────────────────────────────────────
66np.random.seed(42)
67n = 1000
68
69# Imbalanced binary classification (10% positive)
70y_true_cls = (np.random.rand(n) < 0.10).astype(int)
71y_scores   = np.clip(y_true_cls * 0.7 + np.random.rand(n) * 0.4, 0, 1)
72y_pred_cls = (y_scores >= 0.5).astype(int)
73
74print("=== Classification ===")
75print(f"Confusion: {confusion_matrix_counts(y_true_cls, y_pred_cls)}")
76print(f"Accuracy:  {accuracy(y_true_cls, y_pred_cls):.4f}")   # misleadingly high!
77print(f"Precision: {precision(y_true_cls, y_pred_cls):.4f}")
78print(f"Recall:    {recall(y_true_cls, y_pred_cls):.4f}")
79print(f"F1:        {f1_score(y_true_cls, y_pred_cls):.4f}")
80print(f"ROC-AUC:   {roc_auc(y_true_cls, y_scores):.4f}")
81
82# Regression
83y_true_reg = np.random.randn(n) * 10 + 50
84y_pred_reg = y_true_reg + np.random.randn(n) * 3 + 0.5
85
86print("\n=== Regression ===")
87print(f"MSE:  {mse(y_true_reg, y_pred_reg):.4f}")
88print(f"RMSE: {rmse(y_true_reg, y_pred_reg):.4f}")
89print(f"MAE:  {mae(y_true_reg, y_pred_reg):.4f}")
90print(f"R²:   {r2(y_true_reg, y_pred_reg):.4f}")
The AUC from-scratch implementation uses the Mann-Whitney U statistic — equivalent to the trapezoidal area under the ROC curve but without needing to sort and plot. Broadcasting creates an n_pos × n_neg matrix of all pairwise comparisons. This is O(n_pos × n_neg) — for large datasets, use sklearn's efficient O(n log n) sorting-based implementation.
y_test = [1,0,0,1,1,0,1,0,0,1] (10 samples, 50% positive)
y_prob = [0.82, 0.31, 0.15, 0.91, 0.72, 0.43, 0.68, 0.22, 0.09, 0.77]
Confusion (threshold=0.5): TP=5, FP=0, FN=0, TN=5
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1: 1.00
ROC-AUC: 1.00, PR-AUC: 1.00
(Perfect model on this toy example)
  • For imbalanced datasets (< 20% positive), use PR-AUC as primary metric, not ROC-AUC. ROC-AUC can be deceptively high even when the model barely finds positives.
  • Always plot both ROC and PR curves — they reveal different aspects. ROC shows overall ranking quality; PR shows performance specifically on the minority class.
  • Threshold selection should happen on the validation set, never the test set. Tune to maximize your business objective (e.g., maximize recall subject to precision ≥ 0.8).
  • classification_report gives per-class precision, recall, F1, and support — always check per-class performance, not just macro averages.
  • A large gap between RMSE and MAE in regression means a few extreme outliers are dominating RMSE. Investigate these outliers before reporting either metric.
  • Reporting accuracy on an imbalanced dataset — 95% accuracy can mean the model just predicts the majority class for everything.
  • Using predict() instead of predict_proba() for AUC — AUC requires continuous scores, not binary predictions.
  • Tuning the decision threshold on the test set — this leaks test information and gives optimistic threshold performance.
  • Forgetting that macro-averaged F1 weights all classes equally, including tiny classes that may have unstable estimates.
  • Confusing ROC-AUC and PR-AUC — reporting one and claiming the other.
07
⚖️

Balanced Binary Classification

Excellent

Accuracy, precision, recall, F1, and ROC-AUC are all interpretable and meaningful when classes are roughly balanced. No single metric is misleading.

💡 Use F1 as the primary metric; ROC-AUC as secondary. Accuracy is fine here — rarely use it otherwise.
📉

Imbalanced Binary Classification (< 10% positive)

Context-Dependent

Accuracy is catastrophically misleading. ROC-AUC can be inflated. PR-AUC and F1 at optimal threshold are the most informative metrics.

💡 Prioritize PR-AUC and Recall@Precision=X. Never report accuracy alone on imbalanced data.
🏷️

Multiclass Classification

Good

Macro-averaged F1 and per-class classification_report are standard. Micro-averaged metrics collapse to accuracy for balanced multiclass.

💡 Always report per-class metrics — macro averages hide which specific classes are failing.
📊

Regression with Outliers

Context-Dependent

MSE/RMSE are dominated by outliers. MAE is more robust. R² can be high even with systematic bias in certain value ranges.

💡 Report both RMSE and MAE. A large RMSE/MAE gap signals outlier influence. Consider MAPE for proportional evaluation.
🏥

Medical / High-Stakes Binary

Excellent

Recall (sensitivity) and specificity are the primary clinical metrics. PPV (precision) and NPV are reported for screening vs. confirmatory tests.

💡 Decision threshold is set by clinical protocol, not model optimization. Always report sensitivity and specificity at the clinical operating threshold.
🎯

Ranking / Recommendation

Good

Standard metrics (F1, accuracy) are inappropriate for ranking tasks. Use MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), or MRR (Mean Reciprocal Rank).

💡 These topics deserve their own file. sklearn's label_ranking_average_precision_score is a starting point.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: evaluation-metrics

Metric Comparison: Same Model, Different Metrics

Shows how the same model evaluated on an imbalanced dataset (5% positive class) produces dramatically different metric values. Accuracy is deceptively high while F1 and recall reveal poor positive class performance.

Comparison visualization data is documented in this section.

ROC Curve — Three Models Compared

ROC curves for a strong model (AUC=0.93), a weak model (AUC=0.72), and random classifier (AUC=0.50). Each point on a curve corresponds to a different decision threshold. The further the curve bows toward the top-left, the better the model's discrimination ability.

Gradient descent convergence — MSE decreasing over iterations

Precision-Recall Trade-off by Threshold

As the decision threshold increases (stricter about predicting positive), precision rises but recall falls. The F1 score peaks at the optimal threshold (~0.35 in this example). The intersection of precision and recall curves marks the balanced operating point.

Gradient descent convergence — MSE decreasing over iterations

09
  • Business-aligned evaluation

    Metrics can be chosen to directly reflect the cost structure of the problem. F-beta with β=2 penalizes missed positives twice as much as false alarms — a direct encoding of business priority. This makes metrics interpretable to non-technical stakeholders.

  • Threshold-independent analysis with AUC

    ROC-AUC and PR-AUC evaluate the model's entire operating range at once. This is essential for comparing models before deployment, since the threshold is often a business decision made separately from model training.

  • Multiclass support via averaging

    All binary metrics generalize to multiclass via macro, micro, or weighted averaging. Macro averaging is class-imbalance-aware; it ensures rare classes don't get ignored in overall performance summaries.

  • Complementary regression metrics expose different failure modes

    MSE/RMSE reveals catastrophic outlier errors; MAE reveals typical daily errors; R² reveals relative improvement over a naive baseline. Reporting all three together tells a complete story about regression model quality.

  • Confusion matrix enables detailed error analysis

    The confusion matrix reveals the exact nature of errors — not just how many but what type. In multiclass settings, the full confusion matrix shows which classes are being confused with which, guiding targeted improvement.

  • Threshold tuning enables operating point optimization

    By sweeping thresholds and plotting precision-recall trade-off curves, you can select the exact operating point that satisfies business constraints — e.g., 'maximize recall while keeping precision ≥ 80%'. This is powerful, systematic decision-making.

  • Aggregation hides per-sample behavior

    All standard metrics aggregate across samples. A model with average F1=0.85 might have F1=0.99 on easy cases and F1=0.20 on hard, critical cases. Always segment metrics by input slice (e.g., by demographic group, by feature value range) to catch hidden failures.

  • Accuracy paradox for imbalanced data

    A model that predicts the majority class for every sample achieves accuracy equal to the majority class prevalence. At 95% prevalence, this gives 95% accuracy with zero ability to detect the minority class. Accuracy is actively misleading without class balance verification.

  • ROC-AUC optimistic on severely imbalanced data

    ROC-AUC measures performance at every FPR threshold including very low FPR values where the denominator (TN+FP) is dominated by the abundant negatives. A model that barely detects positives can still achieve ROC-AUC > 0.85. Use PR-AUC for imbalanced evaluation.

  • No metric accounts for confidence calibration

    F1 and AUC measure discrimination ability — ranking positives above negatives — but say nothing about whether predicted probabilities are reliable. A model with AUC=0.95 might predict P(positive)=0.80 for samples where the true rate is 0.30. Use calibration curves (reliability diagrams) and Brier score for calibration evaluation.

  • Metric gaming is possible without improvement

    Optimizing a metric directly (especially on training data) can game it without improving real-world performance. Threshold selection on the test set artificially inflates reported metrics. The 'Goodhart's Law' of ML: once a metric becomes a target, it ceases to be a good measure.

10
Healthcare

Cancer screening classifier

Primary metric: Recall at Precision ≥ 0.50 (catch all cancers; tolerate some false positives that get confirmed with further tests). Secondary: PR-AUC for model selection. Never optimize accuracy — disease prevalence of 1% would make 99% accuracy trivial.

Finance

Credit card fraud detection

PR-AUC is the standard metric. Fraud rate is ~0.1%, so ROC-AUC would be deceptively high for any model. F1 at the operational threshold, with attention to the cost of false positives (blocking legitimate transactions) vs. false negatives (missing fraud).

E-Commerce

Email spam classification

Optimize Precision at Recall ≥ 0.70: blocking legitimate email (FP) is very costly; some spam getting through (FN) is tolerable. Set a high threshold (high precision operating point on the PR curve).

Manufacturing

Defect detection (visual inspection)

High recall mandatory: missing a defective product going to market is catastrophic. False positives (flagging good products) trigger human review — expensive but not catastrophic. F-beta with β=2 is appropriate.

Real Estate / Finance

House price prediction (regression)

Report RMSE (absolute error in dollars), MAE (median-influenced error), and R² (explained variance). Compare to baseline RMSE of always predicting the mean. Agents care about MAE; risk managers care about RMSE (outlier sensitivity).

Search / Recommendation

Document retrieval system

Standard classification metrics don't apply. Use Mean Average Precision (MAP) at k=10 — averages precision at each rank position where a relevant document appears. NDCG@k weights by logarithmic rank decay.

11

Evaluation metrics are not interchangeable. Here's a systematic comparison of the most important ones:

Accuracy vs. F1

Both are classification performance metrics, both scale 0 to 1

Accuracy counts all correct predictions (including TN); F1 ignores TN and focuses on the positive class. For imbalanced data, accuracy is misleading; F1 is not.

F1 for imbalanced classes; accuracy only when classes are roughly balanced and FP/FN costs are symmetric.

ROC-AUC vs. PR-AUC

Both are threshold-independent summary metrics

ROC-AUC uses FPR (includes TN in denominator) which dilutes the negative class impact. PR-AUC uses precision (TP/(TP+FP)) which is directly affected by imbalance. PR-AUC is more sensitive to positive class performance.

PR-AUC when positive class is rare (< 15% prevalence). ROC-AUC for balanced classes or when comparing across datasets with different prevalences.

MSE vs. MAE

Both regression loss functions measuring prediction error

MSE squares errors — outliers have quadratic influence. MAE uses absolute values — outliers have linear influence. RMSE is in the same units as y; MSE is in y² units.

MAE when you expect outliers and care about median error. MSE/RMSE when large errors are disproportionately costly and you want to penalize them more.

Precision vs. Recall

Both derived from TP; both binary classification metrics

Precision denominator includes FP (false alarms); recall denominator includes FN (misses). Tuning threshold up → precision increases, recall decreases and vice versa.

Precision: false alarm cost is high (spam filter). Recall: miss cost is high (disease detection). F1 when both matter equally.

MetricUse CaseImbalanced?Threshold-dep?TN included?
AccuracyBalanced classif.✗ No✓ Yes✓ Yes
PrecisionLow FP cost✓ Yes✓ Yes✗ No
RecallLow FN cost✓ Yes✓ Yes✗ No
F1Balanced imbalance✓ Yes✓ Yes✗ No
ROC-AUCModel comparisonPartial✗ No✓ Yes
PR-AUCRare positive class✓ Yes✗ No✗ No
RMSERegressionN/AN/AN/A
MAERegression robustN/AN/AN/A

You need to evaluate the complete performance of a binary classifier independent of threshold choice and class imbalance — use PR-AUC. For regression with potential outliers — use both RMSE and MAE together.

12

Brier Score

Measures the accuracy of probability estimates (calibration). Lower is better. BS=0 is perfect; BS=0.25 is a random 50/50 classifier. Unlike AUC, Brier Score penalizes poor probability estimates even if ranking is good.

Target: < 0.10 for well-calibrated classifiers on low-prevalence datasets

Matthews Correlation Coefficient (MCC)

A balanced metric that accounts for all four confusion matrix cells. Ranges from -1 to +1. MCC=+1 is perfect; MCC=0 is random. More robust than F1 for highly imbalanced datasets because it includes TN.

Target: > 0.5 considered good; > 0.7 strong

MAPE (Mean Absolute Percentage Error)

Expresses regression error as a percentage of actual value. Scale-independent — useful for comparing performance across datasets with different y-scales. Undefined when any yᵢ = 0.

Target: < 10% is excellent; < 20% is good in most business forecasting contexts

  1. 01.1. Before choosing metrics: understand class distribution (value_counts()) and business cost structure (which error type is more costly).
  2. 02.2. Choose primary metric first (the one you'll optimize). Choose secondary metrics to diagnose failure modes.
  3. 03.3. For classification: compute confusion matrix, classification_report, and plot ROC + PR curves.
  4. 04.4. For imbalanced classification: report PR-AUC as primary; note class prevalence explicitly in reports.
  5. 05.5. For regression: report RMSE, MAE, and R². Plot residuals vs. predicted values to check for systematic bias.
  6. 06.6. Perform error analysis: examine the worst predictions (highest errors or misclassified samples) — what do they have in common?
  • Accuracy paradox: 99% accuracy on a 99/1 split means the model might predict all negatives. Always check confusion matrix.
  • AUC does not imply good precision at practical thresholds — a model with AUC=0.90 might have precision=0.05 at 80% recall for a 1% positive class.
  • Optimizing threshold on the test set produces metrics that cannot be reproduced in deployment — always tune on validation set.
  • Macro F1 can be high even when the most common class is misclassified, if rare classes happen to be classified well.

Fraud detection model: 0.2% fraud prevalence in 1M transactions. Accuracy = 99.80% (by predicting no fraud). ROC-AUC = 0.92 (looks great). PR-AUC = 0.43 (more honest — performance on actual fraud detection is mediocre). At threshold=0.7: Precision=0.62, Recall=0.51, F1=0.56. Business decision: lower threshold to 0.5 → Recall=0.72, Precision=0.38, F1=0.50 — more frauds caught but more false flags for the review team.

13
  • ×Reporting accuracy on imbalanced datasets as evidence of good model performance.
  • ×Not knowing that F1 does not include True Negatives — thinking F1 captures all four confusion matrix cells.
  • ×Confusing ROC-AUC with accuracy — AUC of 0.85 does NOT mean '85% of predictions are correct'.
  • ×Thinking higher R² is always better — R² increases when you add any feature, even noise.
  • ×Using predict() for AUC computation instead of predict_proba() — AUC requires probability scores, not binary labels.
  • ×Tuning the decision threshold by evaluating on the test set — the threshold becomes test-set-specific and won't generalize.
  • ×Reporting only the primary metric without diagnosing failure cases — missing systematic errors on specific data slices.
  • ×Averaging AUC across folds in cross-validation — correct is to concatenate out-of-fold predictions and compute a single AUC.
  • ×Saying 'ROC-AUC is always better than PR-AUC' — PR-AUC is more informative for imbalanced datasets.
  • ×Not being able to derive F1 as a harmonic mean — just memorizing the formula without understanding why harmonic mean is used.
  • ×Confusing precision and recall definitions — have them memorized cold: precision = TP/(TP+FP), recall = TP/(TP+FN).
  • ×Not knowing what AUC = 0.5 means (random classifier) or AUC < 0.5 (worse than random — predictions are systematically inverted).
  • ×Not stratifying train/test splits on imbalanced datasets — the test set may have few or zero positive examples.
  • ×Failing to check for calibration — a model with great AUC may give wildly miscalibrated probabilities, making downstream probability-based decisions unreliable.
  • ×Ignoring per-class performance in multiclass settings — a high macro F1 can hide a specific class the model never predicts.
  • ×Using MAPE when y values can be near zero — division by near-zero causes MAPE to blow up.
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • All classification metrics derive from the confusion matrix: TP, FP, FN, TN
  • Precision = TP/(TP+FP) — 'when I predict positive, am I correct?'
  • Recall = TP/(TP+FN) — 'of all actual positives, did I find them?'
  • F1 = harmonic mean of precision and recall = 2TP/(2TP+FP+FN)
  • ROC-AUC = probability a random positive outscores a random negative (threshold-independent)
  • PR-AUC is more informative than ROC-AUC for imbalanced datasets
  • RMSE penalizes large errors more; MAE gives equal weight to all errors
  • R² = 1 - RSS/TSS measures variance explained; negative R² means worse than predicting the mean
Precision
Recall
F1
ROC-AUC (probabilistic)
RMSE
  • F1 or PR-AUC: imbalanced binary classification
  • ROC-AUC: balanced binary, model selection independent of threshold
  • Recall: medical diagnosis, security (FN cost is high)
  • Precision: spam, content moderation (FP cost is high)
  • RMSE: regression where large errors are especially costly
  • MAE: regression where you want robust average error
  • Accuracy on imbalanced data
  • ROC-AUC on severely imbalanced data (< 5% positive)
  • RMSE when outliers dominate and you care about median error
  • Single metric for all deployment decisions — always examine the full metric landscape
Derive precision, recall, F1 from the confusion matrix definition
Explain the probabilistic interpretation of AUC
Explain why accuracy fails for imbalanced datasets with a concrete example
Explain the precision-recall trade-off as threshold changes
Know when to use ROC-AUC vs. PR-AUC
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.