In Plain English
Evaluation metrics are the yardsticks that tell you how well your model performs. Different problems have fundamentally different notions of 'good' — a medical test that misses 50% of cancer cases is catastrophic even if it's 99% accurate on healthy patients. Choosing the right metric is as important as choosing the right model.
Why It Exists
Accuracy is misleading for imbalanced classes, MSE is distorted by outliers, and ROC-AUC ignores class prevalence. Each metric reveals a different aspect of model behavior, and every real problem has a cost structure that determines which failures matter most.
Problem It Solves
Summarizing complex model behavior — a distribution of errors across thousands of predictions — into a single number (or a few numbers) that guides model selection, hyperparameter tuning, and business decision-making.
Real-Life Analogy
"Evaluating a student with only one test score is like using accuracy on an imbalanced dataset — it misses crucial nuance. A student might ace easy questions (98% of the test) while completely failing the hard ones that actually matter. Similarly, a fraud detector might score 99.8% accuracy by labeling everything as 'not fraud' — perfect at easy cases, catastrophic at the important ones."
When To Use
- Accuracy: balanced classes, equal cost of false positives and false negatives
- Precision: cost of false positives is high (spam filter, content moderation)
- Recall: cost of false negatives is high (disease detection, security systems)
- F1: when you need balance and classes are imbalanced
- ROC-AUC: comparing model discrimination ability across all thresholds, threshold-invariant
- PR-AUC: imbalanced datasets where positive class performance matters most
- MSE/RMSE: regression, when large errors should be penalized more
- MAE: regression, when you want interpretable average error robust to outliers
When NOT To Use
- Accuracy for imbalanced classification — a 99% negative class gives 99% accuracy by predicting all negatives
- ROC-AUC when class imbalance is severe — PR-AUC is more informative
- RMSE when outliers are expected and you care about median error — use MAE
- R² alone for regression evaluation — it hides scale and systematic bias
- F1 when the costs of FP and FN are very different — use weighted F-beta score
Every classification model outputs a number (probability or score) for each sample. Before computing any metric, you apply a threshold to convert scores into binary predictions. Most metrics (accuracy, precision, recall, F1) depend on where you set this threshold. ROC-AUC and PR-AUC are threshold-agnostic — they summarize performance across all possible thresholds.
The confusion matrix is the foundation of all classification metrics. It's a 2×2 table: True Positives (correctly predicted positive), True Negatives (correctly predicted negative), False Positives (predicted positive, actually negative), and False Negatives (predicted negative, actually positive). Every classification metric is a function of these four numbers.
For regression, the fundamental trade-off is between MSE (which squares errors, heavily penalizing large misses) and MAE (which sums absolute errors, treating all misses proportionally). Your choice should reflect whether large errors are disproportionately costly in your application.
The Metaphor
"Think of a security guard checking bags. TP = correctly flagged a suspicious bag. TN = correctly passed a safe bag. FP = falsely flagged a safe bag (annoying, delays). FN = missed a truly dangerous bag (catastrophic). Precision measures: of bags you flagged, how many were actually dangerous? Recall measures: of all dangerous bags, how many did you catch? A strict guard (low threshold, flag everything) has high recall but low precision. A lenient guard has high precision but low recall. F1 finds the balance."
Beginner Mental Model
For classification: start with the confusion matrix (four cells: TP, TN, FP, FN). Every metric flows from these. Precision = TP/(TP+FP) = 'when I predict positive, am I right?'. Recall = TP/(TP+FN) = 'of all actual positives, did I find them?'. For regression: RMSE is the 'average error in original units, with extra penalty for big mistakes.' MAE is just the plain average absolute mistake.
Formal Definition
For binary classification with predictions ŷᵢ ∈ {0,1} and ground truth yᵢ ∈ {0,1}: the confusion matrix defines TP = Σ𝟏[ŷᵢ=1, yᵢ=1], FP = Σ𝟏[ŷᵢ=1, yᵢ=0], FN = Σ𝟏[ŷᵢ=0, yᵢ=1], TN = Σ𝟏[ŷᵢ=0, yᵢ=0]. All classification metrics are derived from these counts. For regression with predictions ŷᵢ ∈ ℝ, metrics measure deviations between ŷᵢ and yᵢ under different loss functions.
Key Terms
- True Positive (TP)
- A positive sample correctly predicted as positive. In disease detection: sick patient correctly identified as sick.
- True Negative (TN)
- A negative sample correctly predicted as negative. In disease detection: healthy patient correctly identified as healthy.
- False Positive (FP)
- A negative sample incorrectly predicted as positive. Type I error. In disease detection: healthy patient falsely flagged as sick.
- False Negative (FN)
- A positive sample incorrectly predicted as negative. Type II error. In disease detection: sick patient falsely cleared as healthy.
- Precision
- Of all samples predicted positive, what fraction truly are positive? TP/(TP+FP). High precision = few false alarms.
- Recall (Sensitivity, TPR)
- Of all truly positive samples, what fraction did the model find? TP/(TP+FN). High recall = few misses.
- Specificity (TNR)
- Of all truly negative samples, what fraction did the model correctly identify as negative? TN/(TN+FP). The 'recall for the negative class.'
- ROC Curve
- Receiver Operating Characteristic. A curve plotting TPR (recall) on Y-axis vs FPR (= 1 - specificity) on X-axis at every possible threshold. AUC is the area under this curve.
- PR Curve
- Precision-Recall curve. Plots precision on Y-axis vs. recall on X-axis at every possible threshold. More informative than ROC for severely imbalanced datasets.
- AUC (Area Under Curve)
- Area under the ROC curve. Equals the probability that the model ranks a random positive sample higher than a random negative sample. Perfect model: AUC=1. Random classifier: AUC=0.5.
Step-by-Step Working
- 1. Identify the problem type: binary classification, multiclass, regression, ranking.
- 2. Understand the business cost structure: what's the relative cost of FP vs. FN?
- 3. Check class balance: if positive class < 10% of data, avoid accuracy and ROC-AUC as primary metrics.
- 4. Choose primary metric aligned with costs: precision (FP costly), recall (FN costly), F1 (balanced), PR-AUC (imbalanced).
- 5. Choose secondary metrics to give additional perspective (e.g., RMSE + MAE for regression).
- 6. Apply threshold tuning: find the decision threshold that optimizes your primary metric on the validation set.
- 7. Report metric on held-out test set — never tune threshold on the test set.
Inputs
For classification: predicted class labels or probability scores + ground truth labels. For regression: predicted continuous values + ground truth values.
Outputs
Scalar metric value(s) summarizing model performance. For curve-based metrics (ROC, PR): a list of (threshold, metric) pairs forming a curve, plus the area under it.
Model Assumptions
Important Edge Cases
- ▸Precision is undefined when TP+FP=0 (model never predicts positive). Set to 0 or handle separately.
- ▸Recall is undefined when TP+FN=0 (no actual positives in the dataset). Indicates wrong data split.
- ▸F1 = 0 when precision = 0 or recall = 0 — the model has completely failed on one end.
- ▸R² can be negative — model worse than always predicting the mean. Does not mean R² is unbounded; minimum is -∞.
- ▸AUC = 0.5 for a random classifier exactly when positive and negative score distributions overlap completely.
Role in the ML Pipeline
Evaluation metrics are applied after model training and prediction. In a proper ML pipeline: train on training set → predict on validation set → compute metrics → tune hyperparameters → final evaluation on held-out test set. Metrics guide every iteration of model development.
Data Preprocessing
- 01.Ensure labels are correctly encoded: binary (0/1), multiclass (0,1,2,...), or continuous for regression.
- 02.For imbalanced datasets: stratify train/test splits to maintain class proportions in each split.
- 03.Check for label noise — mislabeled samples inflate FP/FN counts and distort metrics.
- 04.For regression: check target distribution. Highly skewed y may make RMSE misleading (dominated by tail).
Training Process
- 01.Train model on training set. Tune hyperparameters using validation metrics (not test metrics).
- 02.For classification: use predict_proba() to get probability scores, then sweep thresholds to plot ROC/PR curves.
- 03.Apply threshold selection based on the business cost function (e.g., maximize F1 or recall @ precision > 0.8).
- 04.For multiclass: decide averaging strategy (macro, micro, weighted) before evaluation.
- 05.Report final metrics on the held-out test set exactly once — no further tuning after seeing test metrics.
Hyperparameters
Name
Decision Threshold
Description
The probability cutoff that converts model scores into binary predictions.
Typical
0.5 by default; often tuned to 0.3–0.7 depending on FP/FN cost ratio
Name
Averaging method (multiclass)
Description
How to aggregate per-class metrics in multiclass settings: macro (equal weight per class), micro (weight by frequency), weighted (weight by support).
Typical
Weighted F1 for imbalanced multiclass; macro for equal-class treatment
Implementation Checklist
- 1
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score - 2
Generate predictions: y_prob = model.predict_proba(X_test)[:, 1]; y_pred = (y_prob >= threshold).astype(int) - 3
Compute confusion matrix: cm = confusion_matrix(y_test, y_pred) - 4
Print full report: classification_report(y_test, y_pred) — gives precision, recall, F1 per class - 5
Plot ROC curve: fpr, tpr, _ = roc_curve(y_test, y_prob); plt.plot(fpr, tpr) - 6
Plot PR curve: prec, rec, _ = precision_recall_curve(y_test, y_prob); plt.plot(rec, prec) - 7
Tune threshold: find optimal threshold from validation PR curve before applying to test set
1import numpy as np
2
3# ── Classification Metrics from Scratch ───────────────────────────────────────
4def confusion_matrix_counts(y_true, y_pred):
5 """Return TP, FP, FN, TN for binary classification."""
6 y_true, y_pred = np.array(y_true), np.array(y_pred)
7 TP = int(((y_pred == 1) & (y_true == 1)).sum())
8 FP = int(((y_pred == 1) & (y_true == 0)).sum())
9 FN = int(((y_pred == 0) & (y_true == 1)).sum())
10 TN = int(((y_pred == 0) & (y_true == 0)).sum())
11 return TP, FP, FN, TN
12
13def accuracy(y_true, y_pred):
14 return np.mean(np.array(y_true) == np.array(y_pred))
15
16def precision(y_true, y_pred):
17 TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
18 return TP / (TP + FP) if (TP + FP) > 0 else 0.0
19
20def recall(y_true, y_pred):
21 TP, FP, FN, TN = confusion_matrix_counts(y_true, y_pred)
22 return TP / (TP + FN) if (TP + FN) > 0 else 0.0
23
24def f1_score(y_true, y_pred):
25 P = precision(y_true, y_pred)
26 R = recall(y_true, y_pred)
27 return 2 * P * R / (P + R) if (P + R) > 0 else 0.0
28
29def fbeta_score(y_true, y_pred, beta):
30 P = precision(y_true, y_pred)
31 R = recall(y_true, y_pred)
32 denom = beta**2 * P + R
33 return (1 + beta**2) * P * R / denom if denom > 0 else 0.0
34
35def roc_auc(y_true, y_scores):
36 """Compute AUC via the Mann-Whitney U statistic (exact, no sorting trick)."""
37 y_true, y_scores = np.array(y_true), np.array(y_scores)
38 pos = y_scores[y_true == 1]
39 neg = y_scores[y_true == 0]
40 # Count pairs where positive score > negative score
41 n_pos, n_neg = len(pos), len(neg)
42 if n_pos == 0 or n_neg == 0:
43 return float('nan')
44 # Broadcasting: (n_pos, 1) vs (1, n_neg)
45 wins = (pos[:, None] > neg[None, :]).sum()
46 ties = (pos[:, None] == neg[None, :]).sum()
47 return (wins + 0.5 * ties) / (n_pos * n_neg)
48
49# ── Regression Metrics from Scratch ───────────────────────────────────────────
50def mse(y_true, y_pred):
51 return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)
52
53def rmse(y_true, y_pred):
54 return np.sqrt(mse(y_true, y_pred))
55
56def mae(y_true, y_pred):
57 return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))
58
59def r2(y_true, y_pred):
60 y_true, y_pred = np.array(y_true), np.array(y_pred)
61 ss_res = np.sum((y_true - y_pred) ** 2)
62 ss_tot = np.sum((y_true - y_true.mean()) ** 2)
63 return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
64
65# ── Demo ───────────────────────────────────────────────────────────────────────
66np.random.seed(42)
67n = 1000
68
69# Imbalanced binary classification (10% positive)
70y_true_cls = (np.random.rand(n) < 0.10).astype(int)
71y_scores = np.clip(y_true_cls * 0.7 + np.random.rand(n) * 0.4, 0, 1)
72y_pred_cls = (y_scores >= 0.5).astype(int)
73
74print("=== Classification ===")
75print(f"Confusion: {confusion_matrix_counts(y_true_cls, y_pred_cls)}")
76print(f"Accuracy: {accuracy(y_true_cls, y_pred_cls):.4f}") # misleadingly high!
77print(f"Precision: {precision(y_true_cls, y_pred_cls):.4f}")
78print(f"Recall: {recall(y_true_cls, y_pred_cls):.4f}")
79print(f"F1: {f1_score(y_true_cls, y_pred_cls):.4f}")
80print(f"ROC-AUC: {roc_auc(y_true_cls, y_scores):.4f}")
81
82# Regression
83y_true_reg = np.random.randn(n) * 10 + 50
84y_pred_reg = y_true_reg + np.random.randn(n) * 3 + 0.5
85
86print("\n=== Regression ===")
87print(f"MSE: {mse(y_true_reg, y_pred_reg):.4f}")
88print(f"RMSE: {rmse(y_true_reg, y_pred_reg):.4f}")
89print(f"MAE: {mae(y_true_reg, y_pred_reg):.4f}")
90print(f"R²: {r2(y_true_reg, y_pred_reg):.4f}")Sample Input
y_test = [1,0,0,1,1,0,1,0,0,1] (10 samples, 50% positive) y_prob = [0.82, 0.31, 0.15, 0.91, 0.72, 0.43, 0.68, 0.22, 0.09, 0.77]
Sample Output
Confusion (threshold=0.5): TP=5, FP=0, FN=0, TN=5 Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1: 1.00 ROC-AUC: 1.00, PR-AUC: 1.00 (Perfect model on this toy example)
Key Implementation Insights
- →For imbalanced datasets (< 20% positive), use PR-AUC as primary metric, not ROC-AUC. ROC-AUC can be deceptively high even when the model barely finds positives.
- →Always plot both ROC and PR curves — they reveal different aspects. ROC shows overall ranking quality; PR shows performance specifically on the minority class.
- →Threshold selection should happen on the validation set, never the test set. Tune to maximize your business objective (e.g., maximize recall subject to precision ≥ 0.8).
- →classification_report gives per-class precision, recall, F1, and support — always check per-class performance, not just macro averages.
- →A large gap between RMSE and MAE in regression means a few extreme outliers are dominating RMSE. Investigate these outliers before reporting either metric.
Common Implementation Mistakes
- ✗Reporting accuracy on an imbalanced dataset — 95% accuracy can mean the model just predicts the majority class for everything.
- ✗Using predict() instead of predict_proba() for AUC — AUC requires continuous scores, not binary predictions.
- ✗Tuning the decision threshold on the test set — this leaks test information and gives optimistic threshold performance.
- ✗Forgetting that macro-averaged F1 weights all classes equally, including tiny classes that may have unstable estimates.
- ✗Confusing ROC-AUC and PR-AUC — reporting one and claiming the other.
Balanced Binary Classification
Accuracy, precision, recall, F1, and ROC-AUC are all interpretable and meaningful when classes are roughly balanced. No single metric is misleading.
Imbalanced Binary Classification (< 10% positive)
Accuracy is catastrophically misleading. ROC-AUC can be inflated. PR-AUC and F1 at optimal threshold are the most informative metrics.
Multiclass Classification
Macro-averaged F1 and per-class classification_report are standard. Micro-averaged metrics collapse to accuracy for balanced multiclass.
Regression with Outliers
MSE/RMSE are dominated by outliers. MAE is more robust. R² can be high even with systematic bias in certain value ranges.
Medical / High-Stakes Binary
Recall (sensitivity) and specificity are the primary clinical metrics. PPV (precision) and NPV are reported for screening vs. confirmatory tests.
Ranking / Recommendation
Standard metrics (F1, accuracy) are inappropriate for ranking tasks. Use MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), or MRR (Mean Reciprocal Rank).
Mandatory Visual Blueprint
What should move
At least one parameter, threshold, split, cluster state, or metric should change interactively.
What to observe
The learner should see how the concept affects error, fit, grouping, or decision quality.
Planned visual type
Interactive chart, step animation, or side-by-side failure-mode comparison.
Reference image slot
If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.
Topic key: evaluation-metrics
Metric Comparison: Same Model, Different Metrics
Shows how the same model evaluated on an imbalanced dataset (5% positive class) produces dramatically different metric values. Accuracy is deceptively high while F1 and recall reveal poor positive class performance.
ROC Curve — Three Models Compared
ROC curves for a strong model (AUC=0.93), a weak model (AUC=0.72), and random classifier (AUC=0.50). Each point on a curve corresponds to a different decision threshold. The further the curve bows toward the top-left, the better the model's discrimination ability.
Gradient descent convergence — MSE decreasing over iterations
Precision-Recall Trade-off by Threshold
As the decision threshold increases (stricter about predicting positive), precision rises but recall falls. The F1 score peaks at the optimal threshold (~0.35 in this example). The intersection of precision and recall curves marks the balanced operating point.
Gradient descent convergence — MSE decreasing over iterations
Advantages
Business-aligned evaluation
Metrics can be chosen to directly reflect the cost structure of the problem. F-beta with β=2 penalizes missed positives twice as much as false alarms — a direct encoding of business priority. This makes metrics interpretable to non-technical stakeholders.
Threshold-independent analysis with AUC
ROC-AUC and PR-AUC evaluate the model's entire operating range at once. This is essential for comparing models before deployment, since the threshold is often a business decision made separately from model training.
Multiclass support via averaging
All binary metrics generalize to multiclass via macro, micro, or weighted averaging. Macro averaging is class-imbalance-aware; it ensures rare classes don't get ignored in overall performance summaries.
Complementary regression metrics expose different failure modes
MSE/RMSE reveals catastrophic outlier errors; MAE reveals typical daily errors; R² reveals relative improvement over a naive baseline. Reporting all three together tells a complete story about regression model quality.
Confusion matrix enables detailed error analysis
The confusion matrix reveals the exact nature of errors — not just how many but what type. In multiclass settings, the full confusion matrix shows which classes are being confused with which, guiding targeted improvement.
Threshold tuning enables operating point optimization
By sweeping thresholds and plotting precision-recall trade-off curves, you can select the exact operating point that satisfies business constraints — e.g., 'maximize recall while keeping precision ≥ 80%'. This is powerful, systematic decision-making.
Limitations
Aggregation hides per-sample behavior
All standard metrics aggregate across samples. A model with average F1=0.85 might have F1=0.99 on easy cases and F1=0.20 on hard, critical cases. Always segment metrics by input slice (e.g., by demographic group, by feature value range) to catch hidden failures.
Accuracy paradox for imbalanced data
A model that predicts the majority class for every sample achieves accuracy equal to the majority class prevalence. At 95% prevalence, this gives 95% accuracy with zero ability to detect the minority class. Accuracy is actively misleading without class balance verification.
ROC-AUC optimistic on severely imbalanced data
ROC-AUC measures performance at every FPR threshold including very low FPR values where the denominator (TN+FP) is dominated by the abundant negatives. A model that barely detects positives can still achieve ROC-AUC > 0.85. Use PR-AUC for imbalanced evaluation.
No metric accounts for confidence calibration
F1 and AUC measure discrimination ability — ranking positives above negatives — but say nothing about whether predicted probabilities are reliable. A model with AUC=0.95 might predict P(positive)=0.80 for samples where the true rate is 0.30. Use calibration curves (reliability diagrams) and Brier score for calibration evaluation.
Metric gaming is possible without improvement
Optimizing a metric directly (especially on training data) can game it without improving real-world performance. Threshold selection on the test set artificially inflates reported metrics. The 'Goodhart's Law' of ML: once a metric becomes a target, it ceases to be a good measure.
Cancer screening classifier
Primary metric: Recall at Precision ≥ 0.50 (catch all cancers; tolerate some false positives that get confirmed with further tests). Secondary: PR-AUC for model selection. Never optimize accuracy — disease prevalence of 1% would make 99% accuracy trivial.
Credit card fraud detection
PR-AUC is the standard metric. Fraud rate is ~0.1%, so ROC-AUC would be deceptively high for any model. F1 at the operational threshold, with attention to the cost of false positives (blocking legitimate transactions) vs. false negatives (missing fraud).
Email spam classification
Optimize Precision at Recall ≥ 0.70: blocking legitimate email (FP) is very costly; some spam getting through (FN) is tolerable. Set a high threshold (high precision operating point on the PR curve).
Defect detection (visual inspection)
High recall mandatory: missing a defective product going to market is catastrophic. False positives (flagging good products) trigger human review — expensive but not catastrophic. F-beta with β=2 is appropriate.
House price prediction (regression)
Report RMSE (absolute error in dollars), MAE (median-influenced error), and R² (explained variance). Compare to baseline RMSE of always predicting the mean. Agents care about MAE; risk managers care about RMSE (outlier sensitivity).
Document retrieval system
Standard classification metrics don't apply. Use Mean Average Precision (MAP) at k=10 — averages precision at each rank position where a relevant document appears. NDCG@k weights by logarithmic rank decay.
Evaluation metrics are not interchangeable. Here's a systematic comparison of the most important ones:
Accuracy vs. F1
Similarity
Both are classification performance metrics, both scale 0 to 1
Key Difference
Accuracy counts all correct predictions (including TN); F1 ignores TN and focuses on the positive class. For imbalanced data, accuracy is misleading; F1 is not.
Choose When
F1 for imbalanced classes; accuracy only when classes are roughly balanced and FP/FN costs are symmetric.
ROC-AUC vs. PR-AUC
Similarity
Both are threshold-independent summary metrics
Key Difference
ROC-AUC uses FPR (includes TN in denominator) which dilutes the negative class impact. PR-AUC uses precision (TP/(TP+FP)) which is directly affected by imbalance. PR-AUC is more sensitive to positive class performance.
Choose When
PR-AUC when positive class is rare (< 15% prevalence). ROC-AUC for balanced classes or when comparing across datasets with different prevalences.
MSE vs. MAE
Similarity
Both regression loss functions measuring prediction error
Key Difference
MSE squares errors — outliers have quadratic influence. MAE uses absolute values — outliers have linear influence. RMSE is in the same units as y; MSE is in y² units.
Choose When
MAE when you expect outliers and care about median error. MSE/RMSE when large errors are disproportionately costly and you want to penalize them more.
Precision vs. Recall
Similarity
Both derived from TP; both binary classification metrics
Key Difference
Precision denominator includes FP (false alarms); recall denominator includes FN (misses). Tuning threshold up → precision increases, recall decreases and vice versa.
Choose When
Precision: false alarm cost is high (spam filter). Recall: miss cost is high (disease detection). F1 when both matter equally.
| Metric | Use Case | Imbalanced? | Threshold-dep? | TN included? |
|---|---|---|---|---|
| Accuracy | Balanced classif. | ✗ No | ✓ Yes | ✓ Yes |
| Precision | Low FP cost | ✓ Yes | ✓ Yes | ✗ No |
| Recall | Low FN cost | ✓ Yes | ✓ Yes | ✗ No |
| F1 | Balanced imbalance | ✓ Yes | ✓ Yes | ✗ No |
| ROC-AUC | Model comparison | Partial | ✗ No | ✓ Yes |
| PR-AUC | Rare positive class | ✓ Yes | ✗ No | ✗ No |
| RMSE | Regression | N/A | N/A | N/A |
| MAE | Regression robust | N/A | N/A | N/A |
Choose Evaluation Metrics when:
You need to evaluate the complete performance of a binary classifier independent of threshold choice and class imbalance — use PR-AUC. For regression with potential outliers — use both RMSE and MAE together.
Brier Score
Measures the accuracy of probability estimates (calibration). Lower is better. BS=0 is perfect; BS=0.25 is a random 50/50 classifier. Unlike AUC, Brier Score penalizes poor probability estimates even if ranking is good.
Target: < 0.10 for well-calibrated classifiers on low-prevalence datasets
Matthews Correlation Coefficient (MCC)
A balanced metric that accounts for all four confusion matrix cells. Ranges from -1 to +1. MCC=+1 is perfect; MCC=0 is random. More robust than F1 for highly imbalanced datasets because it includes TN.
Target: > 0.5 considered good; > 0.7 strong
MAPE (Mean Absolute Percentage Error)
Expresses regression error as a percentage of actual value. Scale-independent — useful for comparing performance across datasets with different y-scales. Undefined when any yᵢ = 0.
Target: < 10% is excellent; < 20% is good in most business forecasting contexts
Evaluation Process
- 01.1. Before choosing metrics: understand class distribution (value_counts()) and business cost structure (which error type is more costly).
- 02.2. Choose primary metric first (the one you'll optimize). Choose secondary metrics to diagnose failure modes.
- 03.3. For classification: compute confusion matrix, classification_report, and plot ROC + PR curves.
- 04.4. For imbalanced classification: report PR-AUC as primary; note class prevalence explicitly in reports.
- 05.5. For regression: report RMSE, MAE, and R². Plot residuals vs. predicted values to check for systematic bias.
- 06.6. Perform error analysis: examine the worst predictions (highest errors or misclassified samples) — what do they have in common?
Evaluation Traps
- ▸Accuracy paradox: 99% accuracy on a 99/1 split means the model might predict all negatives. Always check confusion matrix.
- ▸AUC does not imply good precision at practical thresholds — a model with AUC=0.90 might have precision=0.05 at 80% recall for a 1% positive class.
- ▸Optimizing threshold on the test set produces metrics that cannot be reproduced in deployment — always tune on validation set.
- ▸Macro F1 can be high even when the most common class is misclassified, if rare classes happen to be classified well.
Real-World Interpretation Example
Fraud detection model: 0.2% fraud prevalence in 1M transactions. Accuracy = 99.80% (by predicting no fraud). ROC-AUC = 0.92 (looks great). PR-AUC = 0.43 (more honest — performance on actual fraud detection is mediocre). At threshold=0.7: Precision=0.62, Recall=0.51, F1=0.56. Business decision: lower threshold to 0.5 → Recall=0.72, Precision=0.38, F1=0.50 — more frauds caught but more false flags for the review team.
Students
- ×Reporting accuracy on imbalanced datasets as evidence of good model performance.
- ×Not knowing that F1 does not include True Negatives — thinking F1 captures all four confusion matrix cells.
- ×Confusing ROC-AUC with accuracy — AUC of 0.85 does NOT mean '85% of predictions are correct'.
- ×Thinking higher R² is always better — R² increases when you add any feature, even noise.
Developers
- ×Using predict() for AUC computation instead of predict_proba() — AUC requires probability scores, not binary labels.
- ×Tuning the decision threshold by evaluating on the test set — the threshold becomes test-set-specific and won't generalize.
- ×Reporting only the primary metric without diagnosing failure cases — missing systematic errors on specific data slices.
- ×Averaging AUC across folds in cross-validation — correct is to concatenate out-of-fold predictions and compute a single AUC.
In Interviews
- ×Saying 'ROC-AUC is always better than PR-AUC' — PR-AUC is more informative for imbalanced datasets.
- ×Not being able to derive F1 as a harmonic mean — just memorizing the formula without understanding why harmonic mean is used.
- ×Confusing precision and recall definitions — have them memorized cold: precision = TP/(TP+FP), recall = TP/(TP+FN).
- ×Not knowing what AUC = 0.5 means (random classifier) or AUC < 0.5 (worse than random — predictions are systematically inverted).
Real Projects
- ×Not stratifying train/test splits on imbalanced datasets — the test set may have few or zero positive examples.
- ×Failing to check for calibration — a model with great AUC may give wildly miscalibrated probabilities, making downstream probability-based decisions unreliable.
- ×Ignoring per-class performance in multiclass settings — a high macro F1 can hide a specific class the model never predicts.
- ×Using MAPE when y values can be near zero — division by near-zero causes MAPE to blow up.
What kind of bias does this model have?
Bias depends on model assumptions and feature expressiveness.
What kind of variance does it have?
Variance grows with model flexibility and weak regularization.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use complexity constraints, robust validation, and data-centric cleanup.
What kind of data does it like?
Prefers representative, low-leakage data with stable feature definitions.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- All classification metrics derive from the confusion matrix: TP, FP, FN, TN
- Precision = TP/(TP+FP) — 'when I predict positive, am I correct?'
- Recall = TP/(TP+FN) — 'of all actual positives, did I find them?'
- F1 = harmonic mean of precision and recall = 2TP/(2TP+FP+FN)
- ROC-AUC = probability a random positive outscores a random negative (threshold-independent)
- PR-AUC is more informative than ROC-AUC for imbalanced datasets
- RMSE penalizes large errors more; MAE gives equal weight to all errors
- R² = 1 - RSS/TSS measures variance explained; negative R² means worse than predicting the mean
Critical Formulas
Best For
- ✓F1 or PR-AUC: imbalanced binary classification
- ✓ROC-AUC: balanced binary, model selection independent of threshold
- ✓Recall: medical diagnosis, security (FN cost is high)
- ✓Precision: spam, content moderation (FP cost is high)
- ✓RMSE: regression where large errors are especially costly
- ✓MAE: regression where you want robust average error
Avoid When
- ✗Accuracy on imbalanced data
- ✗ROC-AUC on severely imbalanced data (< 5% positive)
- ✗RMSE when outliers dominate and you care about median error
- ✗Single metric for all deployment decisions — always examine the full metric landscape
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.