Evaluation Metrics

Concept Overview

In Plain English

Evaluation metrics are the yardsticks that tell you how well your model performs. Different problems have fundamentally different notions of 'good' — a medical test that misses 50% of cancer cases is catastrophic even if it's 99% accurate on healthy patients. Choosing the right metric is as important as choosing the right model.

Why It Exists

Accuracy is misleading for imbalanced classes, MSE is distorted by outliers, and ROC-AUC ignores class prevalence. Each metric reveals a different aspect of model behavior, and every real problem has a cost structure that determines which failures matter most.

Problem It Solves

Summarizing complex model behavior — a distribution of errors across thousands of predictions — into a single number (or a few numbers) that guides model selection, hyperparameter tuning, and business decision-making.

Real-Life Analogy

"Evaluating a student with only one test score is like using accuracy on an imbalanced dataset — it misses crucial nuance. A student might ace easy questions (98% of the test) while completely failing the hard ones that actually matter. Similarly, a fraud detector might score 99.8% accuracy by labeling everything as 'not fraud' — perfect at easy cases, catastrophic at the important ones."

When To Use

Accuracy: balanced classes, equal cost of false positives and false negatives
Precision: cost of false positives is high (spam filter, content moderation)
Recall: cost of false negatives is high (disease detection, security systems)
F1: when you need balance and classes are imbalanced
ROC-AUC: comparing model discrimination ability across all thresholds, threshold-invariant
PR-AUC: imbalanced datasets where positive class performance matters most
MSE/RMSE: regression, when large errors should be penalized more
MAE: regression, when you want interpretable average error robust to outliers

When NOT To Use

Accuracy for imbalanced classification — a 99% negative class gives 99% accuracy by predicting all negatives
ROC-AUC when class imbalance is severe — PR-AUC is more informative
RMSE when outliers are expected and you care about median error — use MAE
R² alone for regression evaluation — it hides scale and systematic bias
F1 when the costs of FP and FN are very different — use weighted F-beta score

Core Intuition

Every classification model outputs a number (probability or score) for each sample. Before computing any metric, you apply a threshold to convert scores into binary predictions. Most metrics (accuracy, precision, recall, F1) depend on where you set this threshold. ROC-AUC and PR-AUC are threshold-agnostic — they summarize performance across all possible thresholds.

The confusion matrix is the foundation of all classification metrics. It's a 2×2 table: True Positives (correctly predicted positive), True Negatives (correctly predicted negative), False Positives (predicted positive, actually negative), and False Negatives (predicted negative, actually positive). Every classification metric is a function of these four numbers.

For regression, the fundamental trade-off is between MSE (which squares errors, heavily penalizing large misses) and MAE (which sums absolute errors, treating all misses proportionally). Your choice should reflect whether large errors are disproportionately costly in your application.

The Metaphor

"Think of a security guard checking bags. TP = correctly flagged a suspicious bag. TN = correctly passed a safe bag. FP = falsely flagged a safe bag (annoying, delays). FN = missed a truly dangerous bag (catastrophic). Precision measures: of bags you flagged, how many were actually dangerous? Recall measures: of all dangerous bags, how many did you catch? A strict guard (low threshold, flag everything) has high recall but low precision. A lenient guard has high precision but low recall. F1 finds the balance."

Beginner Mental Model

For classification: start with the confusion matrix (four cells: TP, TN, FP, FN). Every metric flows from these. Precision = TP/(TP+FP) = 'when I predict positive, am I right?'. Recall = TP/(TP+FN) = 'of all actual positives, did I find them?'. For regression: RMSE is the 'average error in original units, with extra penalty for big mistakes.' MAE is just the plain average absolute mistake.

Technical Theory

Formal Definition

For binary classification with predictions ŷᵢ ∈ {0,1} and ground truth yᵢ ∈ {0,1}: the confusion matrix defines TP = Σ𝟏[ŷᵢ=1, yᵢ=1], FP = Σ𝟏[ŷᵢ=1, yᵢ=0], FN = Σ𝟏[ŷᵢ=0, yᵢ=1], TN = Σ𝟏[ŷᵢ=0, yᵢ=0]. All classification metrics are derived from these counts. For regression with predictions ŷᵢ ∈ ℝ, metrics measure deviations between ŷᵢ and yᵢ under different loss functions.

Key Terms

True Positive (TP): A positive sample correctly predicted as positive. In disease detection: sick patient correctly identified as sick.
True Negative (TN): A negative sample correctly predicted as negative. In disease detection: healthy patient correctly identified as healthy.
False Positive (FP): A negative sample incorrectly predicted as positive. Type I error. In disease detection: healthy patient falsely flagged as sick.
False Negative (FN): A positive sample incorrectly predicted as negative. Type II error. In disease detection: sick patient falsely cleared as healthy.
Precision: Of all samples predicted positive, what fraction truly are positive? TP/(TP+FP). High precision = few false alarms.
Recall (Sensitivity, TPR): Of all truly positive samples, what fraction did the model find? TP/(TP+FN). High recall = few misses.
Specificity (TNR): Of all truly negative samples, what fraction did the model correctly identify as negative? TN/(TN+FP). The 'recall for the negative class.'
ROC Curve: Receiver Operating Characteristic. A curve plotting TPR (recall) on Y-axis vs FPR (= 1 - specificity) on X-axis at every possible threshold. AUC is the area under this curve.
PR Curve: Precision-Recall curve. Plots precision on Y-axis vs. recall on X-axis at every possible threshold. More informative than ROC for severely imbalanced datasets.
AUC (Area Under Curve): Area under the ROC curve. Equals the probability that the model ranks a random positive sample higher than a random negative sample. Perfect model: AUC=1. Random classifier: AUC=0.5.

Step-by-Step Working

1. Identify the problem type: binary classification, multiclass, regression, ranking.
2. Understand the business cost structure: what's the relative cost of FP vs. FN?
3. Check class balance: if positive class < 10% of data, avoid accuracy and ROC-AUC as primary metrics.
4. Choose primary metric aligned with costs: precision (FP costly), recall (FN costly), F1 (balanced), PR-AUC (imbalanced).
5. Choose secondary metrics to give additional perspective (e.g., RMSE + MAE for regression).
6. Apply threshold tuning: find the decision threshold that optimizes your primary metric on the validation set.
7. Report metric on held-out test set — never tune threshold on the test set.

Inputs

For classification: predicted class labels or probability scores + ground truth labels. For regression: predicted continuous values + ground truth values.

Outputs

Scalar metric value(s) summarizing model performance. For curve-based metrics (ROC, PR): a list of (threshold, metric) pairs forming a curve, plus the area under it.

Model Assumptions

01Binary metrics (precision, recall) assume a fixed decision threshold applied to model scores.

02AUC assumes the model produces calibrated or at least orderable probability estimates.

03R² assumes you are comparing to a baseline of predicting the mean — negative R² means worse than this baseline.

04Macro-averaged multiclass metrics weight all classes equally; micro-averaged metrics weight by class frequency.

Important Edge Cases

▸Precision is undefined when TP+FP=0 (model never predicts positive). Set to 0 or handle separately.
▸Recall is undefined when TP+FN=0 (no actual positives in the dataset). Indicates wrong data split.
▸F1 = 0 when precision = 0 or recall = 0 — the model has completely failed on one end.
▸R² can be negative — model worse than always predicting the mean. Does not mean R² is unbounded; minimum is -∞.
▸AUC = 0.5 for a random classifier exactly when positive and negative score distributions overlap completely.

Methodology / Workflow

Role in the ML Pipeline

Evaluation metrics are applied after model training and prediction. In a proper ML pipeline: train on training set → predict on validation set → compute metrics → tune hyperparameters → final evaluation on held-out test set. Metrics guide every iteration of model development.

Data Preprocessing

01.Ensure labels are correctly encoded: binary (0/1), multiclass (0,1,2,...), or continuous for regression.
02.For imbalanced datasets: stratify train/test splits to maintain class proportions in each split.
03.Check for label noise — mislabeled samples inflate FP/FN counts and distort metrics.
04.For regression: check target distribution. Highly skewed y may make RMSE misleading (dominated by tail).

Training Process

01.Train model on training set. Tune hyperparameters using validation metrics (not test metrics).
02.For classification: use predict_proba() to get probability scores, then sweep thresholds to plot ROC/PR curves.
03.Apply threshold selection based on the business cost function (e.g., maximize F1 or recall @ precision > 0.8).
04.For multiclass: decide averaging strategy (macro, micro, weighted) before evaluation.
05.Report final metrics on the held-out test set exactly once — no further tuning after seeing test metrics.

Hyperparameters

Name

Decision Threshold

Description

The probability cutoff that converts model scores into binary predictions.

Typical

0.5 by default; often tuned to 0.3–0.7 depending on FP/FN cost ratio

Name

Averaging method (multiclass)

Description

How to aggregate per-class metrics in multiclass settings: macro (equal weight per class), micro (weight by frequency), weighted (weight by support).

Typical

Weighted F1 for imbalanced multiclass; macro for equal-class treatment

Implementation Checklist

1from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score
2Generate predictions: y_prob = model.predict_proba(X_test)[:, 1]; y_pred = (y_prob >= threshold).astype(int)
3Compute confusion matrix: cm = confusion_matrix(y_test, y_pred)
4Print full report: classification_report(y_test, y_pred) — gives precision, recall, F1 per class
5Plot ROC curve: fpr, tpr, _ = roc_curve(y_test, y_prob); plt.plot(fpr, tpr)
6Plot PR curve: prec, rec, _ = precision_recall_curve(y_test, y_prob); plt.plot(rec, prec)
7Tune threshold: find optimal threshold from validation PR curve before applying to test set

Mathematical Chamber

Implementation

python

1import numpy as np
2
3# ── Classification Metrics from Scratch ───────────────────────────────────────
4def confusion_matrix_counts(y_true, y_pred):
5    """Return TP, FP, FN, TN for binary classification."""
6    y_true, y_pred = np.array(y_true), np.array(y_pred)
7    TP = int(((y_pred == 1) & (y_true == 1)).sum())
8    FP = int(((y_pred == 1) & (y_true == 0)).sum())
9    FN = int(((y_pred == 0) & (y_true == 1)).sum())
10    TN = int(((y_pred == 0) & (y_true == 0)).sum())
11    return TP, FP, FN, TN
12
13def accuracy(y_true, y_pred):
14    return np.mean(np.array(y_true) == np.array(y_pred))
15
16def precision(y_true, y_pred):
17    TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
18    return TP / (TP + FP) if (TP + FP) > 0 else 0.0
19
20def recall(y_true, y_pred):
21    TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
22    return TP / (TP + FN) if (TP + FN) > 0 else 0.0
23
24def f1_score(y_true, y_pred):
25    P = precision(y_true, y_pred)
26    R = recall(y_true, y_pred)
27    return 2 * P * R / (P + R) if (P + R) > 0 else 0.0
28
29def fbeta_score(y_true, y_pred, beta):
30    P = precision(y_true, y_pred)
31    R = recall(y_true, y_pred)
32    denom = beta**2 * P + R
33    return (1 + beta**2) * P * R / denom if denom > 0 else 0.0
34
35def roc_auc(y_true, y_scores):
36    """Compute AUC via the Mann-Whitney U statistic (exact, no sorting trick)."""
37    y_true, y_scores = np.array(y_true), np.array(y_scores)
38    pos = y_scores[y_true == 1]
39    neg = y_scores[y_true == 0]
40    # Count pairs where positive score > negative score
41    n_pos, n_neg = len(pos), len(neg)
42    if n_pos == 0 or n_neg == 0:
43        return float('nan')
44    # Broadcasting: (n_pos, 1) vs (1, n_neg)
45    wins = (pos[:, None] > neg[None, :]).sum()
46    ties = (pos[:, None] == neg[None, :]).sum()
47    return (wins + 0.5 * ties) / (n_pos * n_neg)
48
49# ── Regression Metrics from Scratch ───────────────────────────────────────────
50def mse(y_true, y_pred):
51    return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)
52
53def rmse(y_true, y_pred):
54    return np.sqrt(mse(y_true, y_pred))
55
56def mae(y_true, y_pred):
57    return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))
58
59def r2(y_true, y_pred):
60    y_true, y_pred = np.array(y_true), np.array(y_pred)
61    ss_res = np.sum((y_true - y_pred) ** 2)
62    ss_tot = np.sum((y_true - y_true.mean()) ** 2)
63    return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
64
65# ── Demo ───────────────────────────────────────────────────────────────────────
66np.random.seed(42)
67n = 1000
68
69# Imbalanced binary classification (10% positive)
70y_true_cls = (np.random.rand(n) < 0.10).astype(int)
71y_scores   = np.clip(y_true_cls * 0.7 + np.random.rand(n) * 0.4, 0, 1)
72y_pred_cls = (y_scores >= 0.5).astype(int)
73
74print("=== Classification ===")
75print(f"Confusion: {confusion_matrix_counts(y_true_cls, y_pred_cls)}")
76print(f"Accuracy:  {accuracy(y_true_cls, y_pred_cls):.4f}")   # misleadingly high!
77print(f"Precision: {precision(y_true_cls, y_pred_cls):.4f}")
78print(f"Recall:    {recall(y_true_cls, y_pred_cls):.4f}")
79print(f"F1:        {f1_score(y_true_cls, y_pred_cls):.4f}")
80print(f"ROC-AUC:   {roc_auc(y_true_cls, y_scores):.4f}")
81
82# Regression
83y_true_reg = np.random.randn(n) * 10 + 50
84y_pred_reg = y_true_reg + np.random.randn(n) * 3 + 0.5
85
86print("\n=== Regression ===")
87print(f"MSE:  {mse(y_true_reg, y_pred_reg):.4f}")
88print(f"RMSE: {rmse(y_true_reg, y_pred_reg):.4f}")
89print(f"MAE:  {mae(y_true_reg, y_pred_reg):.4f}")
90print(f"R²:   {r2(y_true_reg, y_pred_reg):.4f}")

The AUC from-scratch implementation uses the Mann-Whitney U statistic — equivalent to the trapezoidal area under the ROC curve but without needing to sort and plot. Broadcasting creates an n_pos × n_neg matrix of all pairwise comparisons. This is O(n_pos × n_neg) — for large datasets, use sklearn's efficient O(n log n) sorting-based implementation.

Sample Input

y_test = [1,0,0,1,1,0,1,0,0,1] (10 samples, 50% positive)
y_prob = [0.82, 0.31, 0.15, 0.91, 0.72, 0.43, 0.68, 0.22, 0.09, 0.77]

Sample Output

Confusion (threshold=0.5): TP=5, FP=0, FN=0, TN=5
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1: 1.00
ROC-AUC: 1.00, PR-AUC: 1.00
(Perfect model on this toy example)

Key Implementation Insights

→For imbalanced datasets (< 20% positive), use PR-AUC as primary metric, not ROC-AUC. ROC-AUC can be deceptively high even when the model barely finds positives.
→Always plot both ROC and PR curves — they reveal different aspects. ROC shows overall ranking quality; PR shows performance specifically on the minority class.
→Threshold selection should happen on the validation set, never the test set. Tune to maximize your business objective (e.g., maximize recall subject to precision ≥ 0.8).
→classification_report gives per-class precision, recall, F1, and support — always check per-class performance, not just macro averages.
→A large gap between RMSE and MAE in regression means a few extreme outliers are dominating RMSE. Investigate these outliers before reporting either metric.

Common Implementation Mistakes

✗Reporting accuracy on an imbalanced dataset — 95% accuracy can mean the model just predicts the majority class for everything.
✗Using predict() instead of predict_proba() for AUC — AUC requires continuous scores, not binary predictions.
✗Tuning the decision threshold on the test set — this leaks test information and gives optimistic threshold performance.
✗Forgetting that macro-averaged F1 weights all classes equally, including tiny classes that may have unstable estimates.
✗Confusing ROC-AUC and PR-AUC — reporting one and claiming the other.

Dataset Applicability

⚖️

Balanced Binary Classification

Excellent

Accuracy, precision, recall, F1, and ROC-AUC are all interpretable and meaningful when classes are roughly balanced. No single metric is misleading.

💡 Use F1 as the primary metric; ROC-AUC as secondary. Accuracy is fine here — rarely use it otherwise.

📉

Imbalanced Binary Classification (< 10% positive)

Context-Dependent

Accuracy is catastrophically misleading. ROC-AUC can be inflated. PR-AUC and F1 at optimal threshold are the most informative metrics.

💡 Prioritize PR-AUC and Recall@Precision=X. Never report accuracy alone on imbalanced data.

🏷️

Multiclass Classification

Good

Macro-averaged F1 and per-class classification_report are standard. Micro-averaged metrics collapse to accuracy for balanced multiclass.

💡 Always report per-class metrics — macro averages hide which specific classes are failing.

📊

Regression with Outliers

Context-Dependent

MSE/RMSE are dominated by outliers. MAE is more robust. R² can be high even with systematic bias in certain value ranges.

💡 Report both RMSE and MAE. A large RMSE/MAE gap signals outlier influence. Consider MAPE for proportional evaluation.

🏥

Medical / High-Stakes Binary

Excellent

Recall (sensitivity) and specificity are the primary clinical metrics. PPV (precision) and NPV are reported for screening vs. confirmatory tests.

💡 Decision threshold is set by clinical protocol, not model optimization. Always report sensitivity and specificity at the clinical operating threshold.

🎯

Ranking / Recommendation

Good

Standard metrics (F1, accuracy) are inappropriate for ranking tasks. Use MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), or MRR (Mean Reciprocal Rank).

💡 These topics deserve their own file. sklearn's label_ranking_average_precision_score is a starting point.

Visualizations

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: evaluation-metrics

Metric Comparison: Same Model, Different Metrics

Shows how the same model evaluated on an imbalanced dataset (5% positive class) produces dramatically different metric values. Accuracy is deceptively high while F1 and recall reveal poor positive class performance.

Comparison visualization data is documented in this section.

ROC Curve — Three Models Compared

ROC curves for a strong model (AUC=0.93), a weak model (AUC=0.72), and random classifier (AUC=0.50). Each point on a curve corresponds to a different decision threshold. The further the curve bows toward the top-left, the better the model's discrimination ability.

Gradient descent convergence — MSE decreasing over iterations

Precision-Recall Trade-off by Threshold

As the decision threshold increases (stricter about predicting positive), precision rises but recall falls. The F1 score peaks at the optimal threshold (~0.35 in this example). The intersection of precision and recall curves marks the balanced operating point.

Gradient descent convergence — MSE decreasing over iterations

Advantages & Limitations

Advantages

Business-aligned evaluation
Metrics can be chosen to directly reflect the cost structure of the problem. F-beta with β=2 penalizes missed positives twice as much as false alarms — a direct encoding of business priority. This makes metrics interpretable to non-technical stakeholders.
Threshold-independent analysis with AUC
ROC-AUC and PR-AUC evaluate the model's entire operating range at once. This is essential for comparing models before deployment, since the threshold is often a business decision made separately from model training.
Multiclass support via averaging
All binary metrics generalize to multiclass via macro, micro, or weighted averaging. Macro averaging is class-imbalance-aware; it ensures rare classes don't get ignored in overall performance summaries.
Complementary regression metrics expose different failure modes
MSE/RMSE reveals catastrophic outlier errors; MAE reveals typical daily errors; R² reveals relative improvement over a naive baseline. Reporting all three together tells a complete story about regression model quality.
Confusion matrix enables detailed error analysis
The confusion matrix reveals the exact nature of errors — not just how many but what type. In multiclass settings, the full confusion matrix shows which classes are being confused with which, guiding targeted improvement.
Threshold tuning enables operating point optimization
By sweeping thresholds and plotting precision-recall trade-off curves, you can select the exact operating point that satisfies business constraints — e.g., 'maximize recall while keeping precision ≥ 80%'. This is powerful, systematic decision-making.

Limitations

Aggregation hides per-sample behavior
All standard metrics aggregate across samples. A model with average F1=0.85 might have F1=0.99 on easy cases and F1=0.20 on hard, critical cases. Always segment metrics by input slice (e.g., by demographic group, by feature value range) to catch hidden failures.
Accuracy paradox for imbalanced data
A model that predicts the majority class for every sample achieves accuracy equal to the majority class prevalence. At 95% prevalence, this gives 95% accuracy with zero ability to detect the minority class. Accuracy is actively misleading without class balance verification.
ROC-AUC optimistic on severely imbalanced data
ROC-AUC measures performance at every FPR threshold including very low FPR values where the denominator (TN+FP) is dominated by the abundant negatives. A model that barely detects positives can still achieve ROC-AUC > 0.85. Use PR-AUC for imbalanced evaluation.
No metric accounts for confidence calibration
F1 and AUC measure discrimination ability — ranking positives above negatives — but say nothing about whether predicted probabilities are reliable. A model with AUC=0.95 might predict P(positive)=0.80 for samples where the true rate is 0.30. Use calibration curves (reliability diagrams) and Brier score for calibration evaluation.
Metric gaming is possible without improvement
Optimizing a metric directly (especially on training data) can game it without improving real-world performance. Threshold selection on the test set artificially inflates reported metrics. The 'Goodhart's Law' of ML: once a metric becomes a target, it ceases to be a good measure.

Practical Use Cases

Healthcare

Cancer screening classifier

Primary metric: Recall at Precision ≥ 0.50 (catch all cancers; tolerate some false positives that get confirmed with further tests). Secondary: PR-AUC for model selection. Never optimize accuracy — disease prevalence of 1% would make 99% accuracy trivial.

Finance

Credit card fraud detection

PR-AUC is the standard metric. Fraud rate is ~0.1%, so ROC-AUC would be deceptively high for any model. F1 at the operational threshold, with attention to the cost of false positives (blocking legitimate transactions) vs. false negatives (missing fraud).

E-Commerce

Email spam classification

Optimize Precision at Recall ≥ 0.70: blocking legitimate email (FP) is very costly; some spam getting through (FN) is tolerable. Set a high threshold (high precision operating point on the PR curve).

Manufacturing

Defect detection (visual inspection)

High recall mandatory: missing a defective product going to market is catastrophic. False positives (flagging good products) trigger human review — expensive but not catastrophic. F-beta with β=2 is appropriate.

Real Estate / Finance

House price prediction (regression)

Report RMSE (absolute error in dollars), MAE (median-influenced error), and R² (explained variance). Compare to baseline RMSE of always predicting the mean. Agents care about MAE; risk managers care about RMSE (outlier sensitivity).

Search / Recommendation

Document retrieval system

Standard classification metrics don't apply. Use Mean Average Precision (MAP) at k=10 — averages precision at each rank position where a relevant document appears. NDCG@k weights by logarithmic rank decay.

Comparison

Evaluation metrics are not interchangeable. Here's a systematic comparison of the most important ones:

Accuracy vs. F1

Similarity

Both are classification performance metrics, both scale 0 to 1

Key Difference

Accuracy counts all correct predictions (including TN); F1 ignores TN and focuses on the positive class. For imbalanced data, accuracy is misleading; F1 is not.

Choose When

F1 for imbalanced classes; accuracy only when classes are roughly balanced and FP/FN costs are symmetric.

ROC-AUC vs. PR-AUC

Similarity

Both are threshold-independent summary metrics

Key Difference

ROC-AUC uses FPR (includes TN in denominator) which dilutes the negative class impact. PR-AUC uses precision (TP/(TP+FP)) which is directly affected by imbalance. PR-AUC is more sensitive to positive class performance.

Choose When

PR-AUC when positive class is rare (< 15% prevalence). ROC-AUC for balanced classes or when comparing across datasets with different prevalences.

MSE vs. MAE

Similarity

Both regression loss functions measuring prediction error

Key Difference

MSE squares errors — outliers have quadratic influence. MAE uses absolute values — outliers have linear influence. RMSE is in the same units as y; MSE is in y² units.

Choose When

MAE when you expect outliers and care about median error. MSE/RMSE when large errors are disproportionately costly and you want to penalize them more.

Precision vs. Recall

Similarity

Both derived from TP; both binary classification metrics

Key Difference

Precision denominator includes FP (false alarms); recall denominator includes FN (misses). Tuning threshold up → precision increases, recall decreases and vice versa.

Choose When

Precision: false alarm cost is high (spam filter). Recall: miss cost is high (disease detection). F1 when both matter equally.

Metric	Use Case	Imbalanced?	Threshold-dep?	TN included?
Accuracy	Balanced classif.	✗ No	✓ Yes	✓ Yes
Precision	Low FP cost	✓ Yes	✓ Yes	✗ No
Recall	Low FN cost	✓ Yes	✓ Yes	✗ No
F1	Balanced imbalance	✓ Yes	✓ Yes	✗ No
ROC-AUC	Model comparison	Partial	✗ No	✓ Yes
PR-AUC	Rare positive class	✓ Yes	✗ No	✗ No
RMSE	Regression	N/A	N/A	N/A
MAE	Regression robust	N/A	N/A	N/A

Choose Evaluation Metrics when:

You need to evaluate the complete performance of a binary classifier independent of threshold choice and class imbalance — use PR-AUC. For regression with potential outliers — use both RMSE and MAE together.

Evaluation

Brier Score

Measures the accuracy of probability estimates (calibration). Lower is better. BS=0 is perfect; BS=0.25 is a random 50/50 classifier. Unlike AUC, Brier Score penalizes poor probability estimates even if ranking is good.

Target: < 0.10 for well-calibrated classifiers on low-prevalence datasets

Matthews Correlation Coefficient (MCC)

A balanced metric that accounts for all four confusion matrix cells. Ranges from -1 to +1. MCC=+1 is perfect; MCC=0 is random. More robust than F1 for highly imbalanced datasets because it includes TN.

Target: > 0.5 considered good; > 0.7 strong

MAPE (Mean Absolute Percentage Error)

Expresses regression error as a percentage of actual value. Scale-independent — useful for comparing performance across datasets with different y-scales. Undefined when any yᵢ = 0.

Target: < 10% is excellent; < 20% is good in most business forecasting contexts

Evaluation Process

01.1. Before choosing metrics: understand class distribution (value_counts()) and business cost structure (which error type is more costly).
02.2. Choose primary metric first (the one you'll optimize). Choose secondary metrics to diagnose failure modes.
03.3. For classification: compute confusion matrix, classification_report, and plot ROC + PR curves.
04.4. For imbalanced classification: report PR-AUC as primary; note class prevalence explicitly in reports.
05.5. For regression: report RMSE, MAE, and R². Plot residuals vs. predicted values to check for systematic bias.
06.6. Perform error analysis: examine the worst predictions (highest errors or misclassified samples) — what do they have in common?

Evaluation Traps

▸Accuracy paradox: 99% accuracy on a 99/1 split means the model might predict all negatives. Always check confusion matrix.
▸AUC does not imply good precision at practical thresholds — a model with AUC=0.90 might have precision=0.05 at 80% recall for a 1% positive class.
▸Optimizing threshold on the test set produces metrics that cannot be reproduced in deployment — always tune on validation set.
▸Macro F1 can be high even when the most common class is misclassified, if rare classes happen to be classified well.

Real-World Interpretation Example

Fraud detection model: 0.2% fraud prevalence in 1M transactions. Accuracy = 99.80% (by predicting no fraud). ROC-AUC = 0.92 (looks great). PR-AUC = 0.43 (more honest — performance on actual fraud detection is mediocre). At threshold=0.7: Precision=0.62, Recall=0.51, F1=0.56. Business decision: lower threshold to 0.5 → Recall=0.72, Precision=0.38, F1=0.50 — more frauds caught but more false flags for the review team.

Common Mistakes

Students

×Reporting accuracy on imbalanced datasets as evidence of good model performance.
×Not knowing that F1 does not include True Negatives — thinking F1 captures all four confusion matrix cells.
×Confusing ROC-AUC with accuracy — AUC of 0.85 does NOT mean '85% of predictions are correct'.
×Thinking higher R² is always better — R² increases when you add any feature, even noise.

Developers

×Using predict() for AUC computation instead of predict_proba() — AUC requires probability scores, not binary labels.
×Tuning the decision threshold by evaluating on the test set — the threshold becomes test-set-specific and won't generalize.
×Reporting only the primary metric without diagnosing failure cases — missing systematic errors on specific data slices.
×Averaging AUC across folds in cross-validation — correct is to concatenate out-of-fold predictions and compute a single AUC.

In Interviews

×Saying 'ROC-AUC is always better than PR-AUC' — PR-AUC is more informative for imbalanced datasets.
×Not being able to derive F1 as a harmonic mean — just memorizing the formula without understanding why harmonic mean is used.
×Confusing precision and recall definitions — have them memorized cold: precision = TP/(TP+FP), recall = TP/(TP+FN).
×Not knowing what AUC = 0.5 means (random classifier) or AUC < 0.5 (worse than random — predictions are systematically inverted).

Real Projects

×Not stratifying train/test splits on imbalanced datasets — the test set may have few or zero positive examples.
×Failing to check for calibration — a model with great AUC may give wildly miscalibrated probabilities, making downstream probability-based decisions unreliable.
×Ignoring per-class performance in multiclass settings — a high macro F1 can hide a specific class the model never predicts.
×Using MAPE when y values can be near zero — division by near-zero causes MAPE to blow up.

Core ML Thinking Lens

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

Summary Cheat Sheet

Quick Revision Reference

Key Takeaways

All classification metrics derive from the confusion matrix: TP, FP, FN, TN
Precision = TP/(TP+FP) — 'when I predict positive, am I correct?'
Recall = TP/(TP+FN) — 'of all actual positives, did I find them?'
F1 = harmonic mean of precision and recall = 2TP/(2TP+FP+FN)
ROC-AUC = probability a random positive outscores a random negative (threshold-independent)
PR-AUC is more informative than ROC-AUC for imbalanced datasets
RMSE penalizes large errors more; MAE gives equal weight to all errors
R² = 1 - RSS/TSS measures variance explained; negative R² means worse than predicting the mean

Critical Formulas

Precision

Recall

ROC-AUC (probabilistic)

RMSE

R²

Best For

✓F1 or PR-AUC: imbalanced binary classification
✓ROC-AUC: balanced binary, model selection independent of threshold
✓Recall: medical diagnosis, security (FN cost is high)
✓Precision: spam, content moderation (FP cost is high)
✓RMSE: regression where large errors are especially costly
✓MAE: regression where you want robust average error

Avoid When

✗Accuracy on imbalanced data
✗ROC-AUC on severely imbalanced data (< 5% positive)
✗RMSE when outliers dominate and you care about median error
✗Single metric for all deployment decisions — always examine the full metric landscape

Interview Must-Know

★Derive precision, recall, F1 from the confusion matrix definition

★Explain the probabilistic interpretation of AUC

★Explain why accuracy fails for imbalanced datasets with a concrete example

★Explain the precision-recall trade-off as threshold changes

★Know when to use ROC-AUC vs. PR-AUC

Interview Questions

Tricky Questions

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.