ML Atlas

Dataset Thinking

Core data concepts that decide whether an ML system works or fails.

Beginner
22 min read
What is Machine Learning?ML Problem Types
  • Feature/label schema design
  • Train/validation/test splitting
  • Data leakage prevention
  • Imbalance handling in fraud and safety systems
01

In Plain English

A dataset is not just rows and columns. It is a contract between reality, features, labels, and evaluation.

Why It Exists

Most model quality problems start as data definition problems.

Problem It Solves

Gives a rigorous way to build reliable inputs before any algorithm tuning.

Real-Life Analogy

"If the ingredients are wrong or mislabeled, even a great chef cannot save the dish."

When To Use

  • At dataset creation
  • During every model iteration

When NOT To Use

  • Never skip this stage
02

Rows are entities/events; columns are features. The target variable is what you want to predict.

Independent variables are inputs; dependent variable is output. Confusing this leads to leakage and false confidence.

Noise, missingness, outliers, and imbalance change model behavior more than minor algorithm changes.

The Metaphor

"Dataset thinking is pre-flight checks before takeoff."

Beginner Mental Model

Good features and clean splits beat complex models on broken data.

03

Dataset quality is judged by representativeness, feature validity, label fidelity, and split integrity.

Feature
Input variable used by the model.
Label/Target
Value the model is trained to predict.
Train/Validation/Test
Data partitions for fit, tune, and final unbiased evaluation.
Data Leakage
Using future or label-proxy information during training.
Missing Data
Absent values that need explicit strategy.
Outliers
Extreme values that may be errors or rare but valid cases.
Class Imbalance
Major skew in class frequencies.
  1. 1. Define prediction unit and timestamp.
  2. 2. Separate features vs target.
  3. 3. Classify features: numerical/categorical.
  4. 4. Build split strategy aligned to production.
  5. 5. Audit leakage, missingness, outliers, imbalance.

Raw tabular/event data and label logic.

Model-ready dataset with documented constraints.

01Splits reflect future serving conditions
02Labeling process is stable enough
  • Temporal drift
  • Target leakage through engineered features
  • Rare-event collapse
04

This topic is the gatekeeper before EDA, modeling, and evaluation.

  • 01.Numerical: scale/check distribution/skew.
  • 02.Categorical: encode and handle unseen categories.
  • 03.Missing: impute or model missingness explicitly.
  • 04.Outliers: cap, transform, or robust-model strategy.
  • 05.Imbalance: class weights, resampling, threshold tuning.
  • 01.Fit preprocessing on train only.
  • 02.Apply same transformations to val/test.
  • 03.Monitor per-split distribution drift.
  1. 1Schema contract
  2. 2Split first
  3. 3Preprocess train-only fit
  4. 4Leakage audit
  5. 5Baseline model
05
06
python
1# 1) split first
2# 2) fit imputer/encoder/scaler on train
3# 3) transform val/test
4# 4) leakage checks
5# 5) imbalance metrics
customer_age, plan_type, last_payment_delay, churn_label
Train-ready matrix + leakage audit report
  • Leakage can create fake high scores that collapse in production.
  • Missingness can carry signal; do not always drop rows blindly.
  • Random split on time series
  • Fitting scaler on full dataset
  • Accuracy-only on imbalanced data
07
table

Well-defined tabular

Excellent

Best for explicit schema control.

💡 Great for beginner foundations.
activity

High-noise logs

Context-Dependent

Needs stronger cleaning and leakage checks.

💡 Expect more iteration.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: dataset-thinking

Train/Validation/Test Split Integrity

Shows why proper split boundaries prevent optimistic evaluation.

Recommended visual: timeline split with strict train -> validation -> test separation and prohibited leakage arrows.

Class Imbalance Effect

Accuracy can stay high while minority recall collapses.

Compare baseline accuracy vs minority recall across imbalance ratios.
09
  • Reliable Evaluation

    Prevents false confidence from leakage.

  • Faster Debugging

    Systematic data diagnostics reduce random trial-and-error.

  • Upfront Effort

    Requires careful schema and split planning.

  • Process Discipline

    Needs reproducible preprocessing pipelines.

10
Healthcare

Readmission prediction

Strict leakage control is mandatory.

Payments

Fraud detection

Severe imbalance and outlier behavior are central.

11

Dataset-thinking-driven workflows outperform model-first workflows in robustness.

Model-first

Both aim for prediction quality

Jumps to algorithm quickly

Only for rapid prototyping with caution.

Data-first

Same end goal

Prioritizes split/schema/leakage integrity

Production and real impact work.

AspectModel-firstData-first
Failure Rate in ProductionHigherLower

Default to data-first for durable ML systems.

12

Missing Rate

Coverage and data reliability indicator.

PSI / Drift Stats

Distribution shift indicator across splits/time.

Recall/Precision by Class

Imbalance-aware performance view.

  1. 01.Audit splits
  2. 02.Audit leakage
  3. 03.Audit missingness/outliers/imbalance
  4. 04.Then evaluate model
  • Global preprocessing fit before split
  • Ignoring timestamp alignment
  • No per-class metrics

If test AUC is high but minority recall is low, deployment risk remains high.

13
  • ×Treating columns as valid features without inference-time checks.
  • ×Leaking post-outcome features into training.
  • ×Saying train/test split without explaining leakage boundaries.
  • ×No schema/versioning for features.
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • Split strategy is foundational.
  • Leakage invalidates evaluation.
  • Imbalance and missingness need explicit plans.
Imbalance Ratio
  • Building trustworthy ML datasets
  • Rushing to algorithm tuning
Explain leakage examples and train-only preprocessing fit.
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.