ML Atlas

Cross-Validation

Your model's grade on unseen data — not the test it studied for.

BeginnerEvaluation
26 min read
Understanding of train/test splitsBasic understanding of overfitting and underfittingFamiliarity with at least one ML model (e.g., linear regression, decision tree)
  • Kaggle competition model selection: 5-fold CV is the de facto standard for leaderboard-robust evaluation
  • Pharmaceutical clinical trials: nested CV ensures drug efficacy estimates aren't optimistic from model selection
  • Time-series forecasting at retail companies: walk-forward validation prevents future data leakage into training
  • AutoML systems (H2O, Auto-sklearn): nested CV is the backbone of automatic model selection and tuning
  • Scikit-learn's GridSearchCV and RandomizedSearchCV: k-fold CV performed internally on the training set
01

In Plain English

Cross-validation is a technique for estimating how well a model will perform on data it has never seen. Instead of using a single train/test split, you split the data multiple times, train and evaluate on each split, and average the results. This gives a more reliable estimate of true performance than any single split.

Why It Exists

A single train/test split gives a noisy estimate of model performance — by chance, the test set might be unusually easy or hard. With small datasets especially, a few hard samples landing in the test set can make an excellent model look poor. CV reduces this noise by averaging over many splits.

Problem It Solves

Reliable model evaluation when data is limited, unbiased hyperparameter tuning without touching test data, and principled model selection between competing architectures — all without sacrificing too much data to a held-out test set.

Real-Life Analogy

"A single exam measures how well a student performs on one set of questions, which might be easy or hard by chance. Giving the student ten different exams on the same material and averaging the scores produces a much more reliable measure of their true ability. Cross-validation is exactly that: multiple different exams on the same data, averaged for reliability."

When To Use

  • Dataset is small (< 5,000 samples) — a held-out test set wastes precious training data
  • Comparing multiple models or hyperparameter configurations to select the best
  • Getting a reliable estimate of generalization performance before final deployment
  • Detecting if a model is overfitting (large gap between training and CV scores)
  • Any time you want to report a single performance estimate with confidence interval

When NOT To Use

  • Training data is very large (> 1M samples) — a simple 80/20 split is faster and equally reliable
  • Data has temporal structure — use time-series CV (walk-forward validation) instead of k-fold
  • Groups of samples are correlated (e.g., multiple measurements per patient) — use GroupKFold
  • Training is extremely expensive (e.g., large neural network) — k full training runs may be infeasible
02

The fundamental problem with a single train/test split: you take one gamble on what samples end up in the test set. If the test set is unrepresentative (too easy or too hard), your performance estimate is wrong — and you have no way to know. Cross-validation fixes this by taking k different 'gambles' and averaging, so any single unlucky split is diluted.

In k-fold cross-validation, you divide the data into k equal parts (folds). In each of k rounds, you train on k-1 folds and evaluate on the remaining fold. Every sample appears in the test set exactly once across all k rounds. The average metric across k rounds is the CV score — a much more stable estimate than any single split.

Cross-validation does not replace a final held-out test set. It's used for model development decisions (which algorithm, which hyperparameters). The test set is reserved for a single final evaluation of the chosen model. If you use CV scores to make decisions and then report CV as 'test performance,' you've leaked information — the CV score now reflects your optimization choices.

The Metaphor

"Imagine you're a chef developing a new recipe. You test it on 10 different dinner guests across 10 separate evenings (cross-validation). Each guest represents a different 'test set.' You average their feedback to decide the recipe is good. Then you serve it at the final gala event (test set) — a fresh audience that hasn't influenced any of your recipe decisions. CV is the iterative tasting; the gala is the one-time final evaluation."

Beginner Mental Model

Split your data into 5 equal groups. Round 1: train on groups 2,3,4,5 — test on group 1. Round 2: train on groups 1,3,4,5 — test on group 2. Repeat for all 5 groups. You get 5 performance scores. Average them. This is your 5-fold CV score. Every data point contributed to both training and testing exactly once.

03

Given dataset D = {(xᵢ, yᵢ)}ᵢ₌₁ⁿ and a learning algorithm A, k-fold cross-validation partitions D into k disjoint subsets D₁, D₂, ..., Dₖ of roughly equal size (|Dⱼ| ≈ n/k). For each fold j: train model Aⱼ on D\Dⱼ (all data except fold j), evaluate metric m(Aⱼ, Dⱼ). The CV estimate is CV_k = (1/k)Σⱼm(Aⱼ, Dⱼ).

Fold
One of k equal partitions of the dataset. In each CV round, one fold is the validation set; the rest form the training set.
CV Score
The average of k metric values computed across all folds. More stable than a single train/test split estimate.
Stratified K-Fold
K-fold where each fold is created to preserve the class distribution of the original dataset. Critical for imbalanced classification problems.
Leave-One-Out Cross-Validation (LOOCV)
A special case where k = n: each sample is the test set exactly once. Maximizes training data per fold but is computationally expensive and has high variance.
Nested Cross-Validation
An outer CV loop for unbiased performance estimation, with an inner CV loop for hyperparameter tuning. Prevents the 'double-dipping' problem where you both tune and evaluate on the same CV.
Walk-Forward Validation
Time-series-specific CV where the training set always precedes the validation set in time. Prevents temporal data leakage. Also called rolling-origin cross-validation.
Data Leakage
When information from the validation/test set influences the training process. The most common cause of unrealistically optimistic CV scores.
Optimism Bias
The tendency for training metrics to be better than validation metrics. CV corrects for this by always evaluating on unseen data.
  1. 1. Choose k (commonly k=5 or k=10). For very small datasets, use LOOCV (k=n).
  2. 2. Randomly shuffle the dataset (unless data has temporal structure).
  3. 3. Split data into k equal folds. For classification: use stratified splitting to preserve class ratios.
  4. 4. For j = 1 to k: train model on folds 1..k except fold j, evaluate metric on fold j.
  5. 5. Compute CV score = mean of k validation scores. Also compute standard deviation to assess stability.
  6. 6. Repeat for each candidate model or hyperparameter configuration.
  7. 7. Select the configuration with the best CV score.
  8. 8. Retrain the selected configuration on ALL training data (all k folds combined).
  9. 9. Evaluate exactly once on the held-out test set to report final performance.

Full training dataset (n samples, d features). Test set held out and untouched until final evaluation. Number of folds k. Learning algorithm and hyperparameter grid.

CV score (mean metric across folds), CV standard deviation (stability measure), and optionally per-fold scores and out-of-fold predictions.

01Data is i.i.d. (independent and identically distributed): samples are exchangeable. Violated by time series, spatial data, or grouped data.
02The data distribution doesn't change between folds: train and validation distributions are similar. Violated by temporal or geographic drift.
03The final model is retrained on all training data after CV: not on any single fold's subset.
  • Tiny dataset (n < 50): LOOCV or k=n is recommended. K-fold with k=5 may create validation folds with < 10 samples — unreliable metric estimates.
  • Severe class imbalance: some folds may have 0 positive samples. Use StratifiedKFold — it guarantees proportional class representation in each fold.
  • Grouped data: multiple samples from the same entity (patient, user). Use GroupKFold to ensure all samples from an entity are in the same fold, preventing entity-level leakage.
  • k = n-1: near-LOOCV. Computationally expensive but maximizes training data. Rarely necessary.
04

Cross-validation is not a model training step — it's a model evaluation and selection framework. It wraps the entire model fitting process (including preprocessing within each fold) and produces unbiased performance estimates. Every preprocessing step (scaling, imputation, feature selection) must happen inside the CV loop, not before it.

  • 01.CRITICAL: All data preprocessing must be fitted on the training folds only, then applied to the validation fold. Fitting StandardScaler on all k folds together leaks validation statistics into training.
  • 02.Use sklearn Pipelines to enforce this: Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) inside cross_val_score is automatically correct.
  • 03.Handle class imbalance (SMOTE, class_weight) inside the CV loop — applying SMOTE before CV leaks synthesized validation samples into training.
  • 04.Feature selection (SelectKBest, RFE) inside a Pipeline ensures selected features are chosen without seeing the validation fold.
  • 01.Wrap model and preprocessing in sklearn Pipeline for clean, leakage-proof CV.
  • 02.Call cross_val_score(pipeline, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='roc_auc') for classification.
  • 03.For hyperparameter tuning: use GridSearchCV or RandomizedSearchCV — these internally perform k-fold CV on the training set.
  • 04.Examine both mean and standard deviation of CV scores: high std = unstable model, low mean + low std = consistently poor model.
  • 05.After selecting the best configuration: refit on ALL training data (cross_validate's 'refit=True' or manual refit).
  • 06.Report final performance on test set once.

k (number of folds)

Primary parameter of k-fold CV. Controls the trade-off between bias and variance of the CV estimate.

5 or 10. k=5 is faster; k=10 gives slightly better estimates. k=n (LOOCV) for very small datasets.

shuffle

Whether to shuffle the data before creating folds. Should be True for i.i.d. data, False for time series.

True with random_state for reproducibility

  1. 1from sklearn.model_selection import cross_val_score, StratifiedKFold, KFold
  2. 2Create a Pipeline: pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
  3. 3Define CV strategy: cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
  4. 4Run CV: scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='roc_auc')
  5. 5Inspect: print(f'CV AUC: {scores.mean():.4f} ± {scores.std():.4f}')
  6. 6For hyperparameter tuning: GridSearchCV(pipe, param_grid, cv=cv, scoring='roc_auc').fit(X_train, y_train)
  7. 7Final evaluation: best_model.score(X_test, y_test) — call this exactly once
05
06
python
1import numpy as np
2from collections import defaultdict
3
4class KFold:
5    """Standard k-fold cross-validation splitter."""
6    def __init__(self, n_splits=5, shuffle=True, random_state=None):
7        self.n_splits = n_splits
8        self.shuffle  = shuffle
9        self.rng      = np.random.RandomState(random_state)
10
11    def split(self, X):
12        n = len(X)
13        indices = np.arange(n)
14        if self.shuffle:
15            self.rng.shuffle(indices)
16        fold_sizes = np.full(self.n_splits, n // self.n_splits)
17        fold_sizes[:n % self.n_splits] += 1   # distribute remainder
18        current = 0
19        for fold_size in fold_sizes:
20            val_idx   = indices[current : current + fold_size]
21            train_idx = np.concatenate([indices[:current],
22                                         indices[current + fold_size:]])
23            yield train_idx, val_idx
24            current += fold_size
25
26
27class StratifiedKFold:
28    """K-fold that preserves class proportion in each fold."""
29    def __init__(self, n_splits=5, shuffle=True, random_state=None):
30        self.n_splits = n_splits
31        self.shuffle  = shuffle
32        self.rng      = np.random.RandomState(random_state)
33
34    def split(self, X, y):
35        y = np.array(y)
36        classes, y_idx, y_counts = np.unique(y, return_inverse=True, return_counts=True)
37
38        # Group indices by class
39        class_indices = defaultdict(list)
40        for i, cls in enumerate(y):
41            class_indices[cls].append(i)
42
43        if self.shuffle:
44            for cls in class_indices:
45                self.rng.shuffle(class_indices[cls])
46
47        # Distribute each class's indices across folds
48        fold_indices = [[] for _ in range(self.n_splits)]
49        for cls, idxs in class_indices.items():
50            for fold_num, i in enumerate(idxs):
51                fold_indices[fold_num % self.n_splits].append(i)
52
53        for fold_num in range(self.n_splits):
54            val_idx   = np.array(fold_indices[fold_num])
55            train_idx = np.concatenate([fold_indices[f]
56                                         for f in range(self.n_splits)
57                                         if f != fold_num])
58            yield train_idx.astype(int), train_idx.astype(int)
59            # Corrected: val should be separate
60            yield (np.concatenate([fold_indices[f]
61                                    for f in range(self.n_splits)
62                                    if f != fold_num], dtype=int),
63                   np.array(fold_indices[fold_num], dtype=int))
64
65
66def cross_val_score_scratch(model_class, X, y, n_splits=5,
67                             shuffle=True, random_state=42):
68    """Cross-validate any model with fit(X,y) and score(X,y) interface."""
69    X, y = np.array(X), np.array(y)
70    kf = KFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
71    scores = []
72    for fold_i, (train_idx, val_idx) in enumerate(kf.split(X)):
73        X_tr, y_tr = X[train_idx], y[train_idx]
74        X_val, y_val = X[val_idx], y[val_idx]
75        model = model_class()
76        model.fit(X_tr, y_tr)
77        score = model.score(X_val, y_val)
78        scores.append(score)
79        print(f"  Fold {fold_i+1}/{n_splits}: score = {score:.4f}")
80    return np.array(scores)
81
82
83# ── Demo: Linear Regression 5-fold CV ─────────────────────────────────────────
84from sklearn.linear_model import LinearRegression
85from sklearn.preprocessing import StandardScaler
86
87np.random.seed(42)
88X = np.random.randn(200, 5)
89y = X @ np.array([2, -1, 0.5, 3, -2]) + np.random.randn(200) * 0.5
90
91# Wrap so it has a no-arg constructor for our cross_val_score_scratch
92class ScaledLinearRegression:
93    def __init__(self):
94        self.scaler = StandardScaler()
95        self.model  = LinearRegression()
96    def fit(self, X, y):
97        X_s = self.scaler.fit_transform(X)
98        self.model.fit(X_s, y)
99        return self
100    def score(self, X, y):
101        X_s = self.scaler.transform(X)
102        return self.model.score(X_s, y)   # R²
103
104print("5-fold CV from scratch (R²):")
105scores = cross_val_score_scratch(ScaledLinearRegression, X, y, n_splits=5)
106print(f"Mean R²: {scores.mean():.4f} ± {scores.std():.4f}")
The from-scratch KFold correctly distributes remainder samples across the first folds (fold_sizes[:n % k] += 1) to handle datasets where n is not exactly divisible by k. Note that all preprocessing (StandardScaler) lives inside the model wrapper — it must be fit on training indices only, never on the validation fold.
X_train: shape (800, 20). y_train: binary, 20% positive. StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
5-fold CV AUC: 0.8823 ± 0.0142
Out-of-fold AUC: 0.8819
Nested CV AUC: 0.8751 ± 0.0211 (lower than non-nested — reflects true generalization)
Final Test AUC: 0.8791
  • Always use sklearn Pipelines inside cross-validation. If you apply StandardScaler before calling cross_val_score, you've leaked test statistics into every training fold — a systematic data leakage bug.
  • cross_val_predict concatenates out-of-fold predictions into a single array. Compute one AUC on this array rather than averaging per-fold AUCs — this gives a more principled single estimate.
  • Nested CV provides the most unbiased performance estimate when hyperparameter tuning is involved. The outer loop estimates performance; the inner loop selects hyperparameters. Never use the outer CV AUC for hyperparameter tuning decisions.
  • TimeSeriesSplit respects temporal ordering: training set always comes before validation set in time. Never use standard KFold on time series — it causes future data to appear in the training set.
  • After selecting the best model/hyperparameters via CV: refit on ALL training data. Don't use the model from any single fold — it was trained on (k-1)/k of the training data.
  • Fitting StandardScaler (or any preprocessor) on all data before cross_val_score — this leaks validation set statistics into every training fold.
  • Using cross_val_score's score as the 'test performance' to report — it's a training-phase metric used for model selection, not a final generalization claim.
  • Tuning hyperparameters based on CV score, then reporting the same CV score as unbiased performance — this is non-nested CV and produces optimistic estimates.
  • Using standard KFold on time series data — folds will contain future data in training sets, making the model look better than it will perform in production.
07
📋

Small Dataset (< 1,000 rows)

Excellent

Cross-validation is essential for small datasets. A 80/20 split wastes 200 precious samples for testing. CV uses all samples for both training and validation across folds, maximizing data utilization.

💡 Use k=10 or LOOCV for very small datasets. StratifiedKFold for classification.
📊

Medium Dataset (1K–100K rows)

Excellent

The sweet spot for cross-validation. Fast enough to run 5 or 10 folds, large enough for stable estimates. Standard k=5 with stratification is the default recommendation.

💡 5-fold CV is standard. Use 10-fold if compute is cheap.
🗄️

Large Dataset (> 1M rows)

Good

CV is computationally expensive for large datasets — k full training runs, each on (k-1)/k of the data. A simple 80/20 split often gives equally reliable estimates with far less compute.

💡 Consider 3-fold CV or a single large validation split. Use mini-batch training (SGD) to make each fold faster.
📈

Time Series Data

Context-Dependent

Standard KFold is inappropriate — it ignores temporal ordering and leaks future information. Walk-forward validation (TimeSeriesSplit) is required. Performance estimates from walk-forward CV accurately reflect deployment conditions.

💡 Always use TimeSeriesSplit for temporal data. Consider a 'gap' period between training and validation to simulate deployment latency.
⚖️

Imbalanced Dataset

Good

Standard KFold risks creating folds with no positive class samples. StratifiedKFold guarantees class proportions are maintained in each fold. Critical for meaningful metric computation.

💡 Always use StratifiedKFold for imbalanced classification. Also stratify train/test splits.
🔗

Grouped / Clustered Data

Context-Dependent

When samples are not i.i.d. (multiple images from the same patient, multiple rows from the same user), standard KFold leaks group-level information between folds. Use GroupKFold to ensure whole groups stay together.

💡 GroupKFold is mandatory for grouped data. Failure to use it produces dramatically optimistic CV estimates that don't reflect true generalization.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: cross-validation

K-Fold CV — Data Partitioning Across 5 Folds

Visualization of how data is split across 5 rounds of k-fold cross-validation. In each round, one fold (dark) is used for validation and the remaining four folds (light) are used for training. Every sample appears in the validation set exactly once across all 5 rounds.

Comparison visualization data is documented in this section.

Walk-Forward Validation — Time Series CV

In time-series cross-validation, the training window expands forward in time. Each validation set immediately follows its training set, simulating real deployment conditions where the model always predicts the future from historical data. Standard k-fold would mix past and future samples, producing leaky (optimistic) estimates.

Comparison visualization data is documented in this section.

CV Score Distribution Across Folds — 3 Models Compared

Box plot of per-fold AUC scores for three models. Model A has high mean AUC but high variance across folds (unstable). Model B has lower mean but tight variance (consistent). Model C is consistently strong. In production, Model C is preferred — its reliability is more important than marginally higher average performance.

Comparison visualization data is documented in this section.
09
  • Maximum data utilization

    Every sample is used for both training and validation across k rounds. A 5-fold CV trains on 80% of data each round and validates on 20%, but over all 5 rounds, every sample has been in both sets. No data is permanently wasted on a single held-out partition.

  • Stable, reliable performance estimates

    By averaging over k splits, CV smooths out the variance from unlucky train/test splits. The CV estimate is far less sensitive to a single outlier fold than a single split. Reporting mean ± std gives honest uncertainty quantification.

  • Built-in overfitting detection

    Comparing training score to validation score across folds reveals overfitting: if training AUC = 0.98 and validation AUC = 0.72 consistently, the model is memorizing training data. No separate overfitting check needed.

  • Principled hyperparameter tuning without test set leakage

    CV provides an objective criterion for selecting hyperparameters that doesn't touch the test set. GridSearchCV internally runs k-fold CV for each hyperparameter configuration, selecting the best without ever using test data.

  • Works with small datasets where a large hold-out set is costly

    On 200-sample datasets, reserving 40 samples as a test set is wasteful. 5-fold CV uses all 200 samples for training in each fold while still providing an unbiased performance estimate.

  • Computationally expensive

    Training k models instead of 1 multiplies compute by k. For deep learning with hour-long training runs, 5-fold CV takes 5 hours. Not always feasible. Approximations: k=3 instead of 10, or a single large hold-out split for large datasets.

  • Invalid for time series (standard k-fold)

    Standard k-fold creates validation sets that precede their training sets in time — an information leak. A model trained on data including t=100 cannot be honestly evaluated on data from t=50. Walk-forward validation must be used instead.

  • Invalid for grouped data (standard k-fold)

    When multiple samples come from the same entity (patient, user, device), standard k-fold may place samples from the same entity in both training and validation. The model sees the entity's pattern during training and exploits it in validation — inflating CV scores.

  • CV score is still an estimate with uncertainty

    High variance in fold scores (large std) means the CV estimate itself is unreliable. With k=5, you have only 5 data points for averaging. The true 95% confidence interval for a CV estimate is often much wider than practitioners expect.

  • Non-nested CV with hyperparameter tuning is optimistically biased

    If you tune hyperparameters using CV scores and then report those same CV scores as performance estimates, the reported performance is optimistic. Nested CV corrects this but adds another k× compute overhead.

10
Healthcare / Clinical ML

Nested CV for unbiased diagnostic model development

Predicting ICU readmission from EHR features requires nested CV: inner loop selects the best regularization strength for logistic regression, outer loop estimates AUC. The outer CV estimate represents the model's true expected performance when deployed on new patients.

Finance

Walk-forward validation for algorithmic trading

A stock return prediction model trained with standard k-fold would be evaluated on past data using future information in training — an obvious look-ahead bias. Walk-forward validation trains on months 1-12, validates on month 13, then extends to month 14, etc., faithfully simulating live trading deployment.

NLP / Text Classification

Model selection for document classification

Comparing BERT fine-tuning, logistic regression on TF-IDF, and CNN text classification: run 5-fold CV for each, report mean ± std F1. The model with highest CV F1 and lowest std is selected. All fitting (TF-IDF vectorizer, tokenizer fine-tuning) happens inside each fold.

Drug Discovery / Bioinformatics

Leave-one-out CV for small datasets

A molecular property prediction model trained on 50 compounds uses LOOCV: train on 49, predict one, repeat for all 50. This maximizes use of precious experimental data. The LOOCV RMSE is the reported performance metric in the published paper.

General ML / Kaggle

Out-of-fold predictions for model stacking

Cross_val_predict generates out-of-fold predictions for all training samples. These OOF predictions from multiple base models (random forest, XGBoost, LightGBM) are used as features for a meta-learner (logistic regression). This is the standard Kaggle stacking approach.

11

Different CV strategies are designed for different data characteristics. Here's how they compare:

Hold-Out (Single Split)

Also evaluates on data the model hasn't seen

Only one train/test split. High variance — sensitive to which samples end up in test. Wastes data on a permanent test partition. Appropriate for very large datasets where variance is low.

n > 500K, training is very expensive (deep learning), or a strict separation between development and evaluation is required by protocol.

LOOCV (Leave-One-Out)

Same principle as k-fold, k = n

Maximum training data per fold (n-1 samples), maximum compute (n models), high variance of the CV estimate (n very correlated folds). Nearly unbiased estimate of performance on the full training set.

n < 50 (very small datasets where any partition is wasteful), or when using linear models where LOOCV has an analytical shortcut.

TimeSeriesSplit

Also performs k-fold style splitting

Respects temporal ordering: training always precedes validation in time. Training set expands with each split. No shuffling. Prevents future leakage.

Any temporal data: financial time series, weather, IoT sensor readings, retail demand forecasting.

GroupKFold

Same evaluation principle as k-fold

Ensures no sample from a given group appears in both training and validation. Groups (patients, users, devices) are split, not individual samples.

Multiple correlated samples per entity: patient repeat visits, user browsing sessions, multiple trials per experimental subject.

StrategyData requirementBiasVarianceCompute
Hold-Outi.i.d., largeHighHigh1× train
5-Fold CVi.i.d.LowMedium5× train
10-Fold CVi.i.d.LowerMedium10× train
LOOCVi.i.d., smallLowestHighn× train
Stratified K-FoldClassif., imbal.LowMediumk× train
TimeSeriesSplitTemporalLowMediumk× train
GroupKFoldGroupedLowMediumk× train

Data is i.i.d., dataset is small to medium, and you need reliable performance estimates. Use 5-fold stratified CV for classification; 5-fold CV for regression; TimeSeriesSplit for temporal; GroupKFold for grouped observations.

12

CV Mean Score

The primary output of cross-validation. Estimates expected model performance on unseen data drawn from the same distribution. Use as the objective for model selection.

Target: Depends on the metric and domain. More informative when compared to baseline.

CV Standard Deviation

Measures how much fold scores vary. High std = unstable model sensitive to which data is in training vs. validation. A stable model should have std < 0.02 for AUC. High std is a warning sign.

Target: < 0.02 for AUC, < 0.05 for F1 typically indicates stable estimates

Train-Validation Gap

Average difference between training fold score and validation fold score. Large positive gap = overfitting. Near-zero gap = good generalization. Negative gap (validation > training) = underfitting or lucky validation set.

Target: < 0.03 for AUC is acceptable; > 0.10 signals significant overfitting

  1. 01.1. Confirm your CV strategy matches data characteristics: stratified for imbalanced, temporal for time series, grouped for correlated samples.
  2. 02.2. Verify no preprocessing leak: all transformations are inside the Pipeline applied within each fold.
  3. 03.3. Examine per-fold scores (not just mean) — identify if one fold is consistently much worse (hints at data quality issues in that fold).
  4. 04.4. Compare mean train score vs. mean validation score — diagnose overfitting vs. underfitting.
  5. 05.5. For hyperparameter tuning: use CV score as the selection criterion. Report nested CV score as performance estimate.
  6. 06.6. Lock the final model/hyperparameters and evaluate on the test set exactly once.
  • Preprocessing outside the CV loop (fitting scaler, encoder, imputer on all data before CV) — this is the most common and harmful CV data leakage.
  • Using CV score as the 'test set performance' in your paper or report — CV is a development metric, not a final estimate.
  • Ignoring standard deviation: two models with mean AUC 0.82 and 0.83 but std 0.01 vs. 0.09 — the first is far more reliable.
  • Reporting non-nested CV performance when hyperparameters were tuned on the same CV folds — optimistically biased estimates.

Fraud detection pipeline: 5-fold stratified CV on 10,000 training samples. Results: AUC = 0.883 ± 0.018. Train AUC = 0.921 ± 0.009. Gap = 0.038 — mild overfitting. Tried Ridge (C=0.1): AUC = 0.876 ± 0.012, gap = 0.012 — better regularized, more stable. Decision: use C=0.1 despite slightly lower mean AUC because the tighter std and smaller gap indicate more reliable generalization. Final test set AUC = 0.871 — close to the nested CV estimate of 0.869.

13
  • ×Fitting preprocessing (StandardScaler, imputer) on all data before cross_val_score — the single most common CV mistake in student projects.
  • ×Thinking cross-validation replaces a test set — it doesn't. CV is for development; the test set is for final reporting.
  • ×Using the same random_state for every fold — a common misunderstanding; KFold's random_state controls the shuffling before splitting, not per-fold randomness.
  • ×Running LOOCV on large datasets — computing n training runs on n=10,000 samples runs 10,000 model fits instead of 5, making it 2000× slower.
  • ×Applying SMOTE or other oversampling to the entire training set before cross-validation — synthesized samples from the validation set's neighborhood leak into training folds.
  • ×Using cross_val_score with a raw model when the pipeline includes a scaler — the scaler is fitted on all training+validation data in the first line of cross_val_score, causing leakage.
  • ×Reporting GridSearchCV's best_score_ as unbiased test performance — this is the non-nested CV score, optimistically biased by hyperparameter selection.
  • ×Not setting shuffle=True in KFold — if data is ordered by class or by collection time, non-shuffled folds may be systematically non-representative.
  • ×Confusing the validation set (used during training for early stopping or model selection) with the test set (final evaluation, never used during development).
  • ×Not knowing why nested CV is necessary — confusing non-nested CV performance with unbiased generalization performance after hyperparameter tuning.
  • ×Saying 'cross-validation prevents overfitting' — it doesn't prevent it; it detects it by comparing train and validation performance.
  • ×Not knowing what to do with CV results: select the best configuration and retrain on ALL training data (not the best fold's model).
  • ×Using standard KFold on time-series data in production — the model that 'performs well' in CV performs poorly in deployment because CV used future data in training.
  • ×Tuning 50 hyperparameter combinations and reporting the best CV score — this is 50× multiple comparison inflation. Use nested CV or apply a Bonferroni-like correction.
  • ×Not stratifying CV splits for imbalanced data — folds with 0 positive samples cause undefined metrics (division by zero in precision) that sklearn silently replaces with 0.
  • ×Treating GroupKFold-required data as i.i.d. — if users appear in multiple rows and the model memorizes user-specific patterns, non-group CV scores are wildly optimistic.
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • K-fold CV: split data into k folds, rotate the validation fold, average k metrics
  • CV estimate = (1/k)Σⱼ metric(model_j, fold_j) — always report mean ± std
  • ALWAYS fit preprocessors inside each fold (use sklearn Pipeline) — never before CV
  • Use StratifiedKFold for classification; TimeSeriesSplit for temporal; GroupKFold for grouped data
  • CV is for model development (selection, tuning) — the test set is for final one-time reporting
  • Nested CV: outer loop estimates performance; inner loop selects hyperparameters — prevents optimism bias
  • Out-of-fold predictions from cross_val_predict computed over the full training set give more principled AUC than averaging fold AUCs
CV Score
LOOCV Shortcut (OLS)
Train-Val Gap
  • Small to medium datasets where a single split is unreliable
  • Hyperparameter tuning: use CV as the objective for GridSearchCV
  • Model comparison: select the model with highest CV mean and lowest CV std
  • Detecting overfitting by examining the train-validation gap
  • Very large datasets (> 1M rows) where a single large split is cheaper and equally reliable
  • Time series data with standard k-fold (use TimeSeriesSplit)
  • Deep learning with very long training runs (k-fold multiplies cost by k)
  • Grouped data with standard k-fold (use GroupKFold)
Explain why preprocessing must happen inside the CV loop (data leakage)
Explain the difference between nested and non-nested CV
Know when to use stratified, grouped, and time-series CV variants
Know what to do with CV results: refit on all training data, then evaluate on test set once
Explain why CV standard deviation underestimates true uncertainty (fold correlation)
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.