ML Atlas

Feature Engineering

The art of transforming raw data into the signal your model actually needs.

IntermediateSupervised
34 min read
Basic statistics: mean, variance, distributionsUnderstanding of at least one supervised ML algorithmFamiliarity with pandas DataFrames for data manipulation
  • Kaggle grandmasters: top solutions always cite feature engineering as the decisive factor over algorithm choice
  • Credit scoring: raw transaction amounts engineered into velocity features (spend last 7d, 30d, 90d) by Amex and Capital One
  • Uber surge pricing model: time-of-day, weather, and event proximity features derived from raw timestamps and coordinates
  • Netflix recommendations: user-item interaction features (recency, frequency, diversity of genre) derived from raw watch history
  • Manufacturing defect detection: vibration sensor readings aggregated into statistical features (mean, std, kurtosis) per rolling window
01

In Plain English

Feature engineering is the process of using domain knowledge and mathematical transformations to create new input features from raw data, making patterns easier for machine learning algorithms to learn. Raw data rarely has the ideal representation — feature engineering bridges the gap between data and model.

Why It Exists

Most ML algorithms are mathematically constrained: linear models can only see linear relationships, tree models can't interpolate between values, neural networks need normalized inputs. Feature engineering re-represents data to expose signal that algorithms can capture.

Problem It Solves

Raw data has missing values, mixed scales, categorical labels, and buried relationships that models can't directly exploit. Feature engineering creates a numeric, complete, appropriately scaled feature matrix where signal is exposed and predictable.

Real-Life Analogy

"A raw ingredient (say, a potato) can be transformed into very different things: mashed, fried, boiled, or as starch. Each transformation makes the potato suitable for a different dish (downstream model). Feature engineering is choosing and applying the right transformation so the resulting 'ingredient' works well in your specific recipe."

When To Use

  • Before any ML model training — feature engineering is always part of the pipeline
  • When model performance plateaus and adding more data or algorithms doesn't help
  • When you have domain knowledge that isn't captured in raw feature values
  • When features are on wildly different scales (need scaling)
  • When categorical features need to be converted to numeric representation
  • When missing values are present (need imputation)

When NOT To Use

  • Deep learning on images/text/audio: the network learns features automatically — manual feature engineering is often counterproductive
  • When you have so little domain knowledge that engineered features are just noise
  • When the raw feature set already perfectly represents the problem (rare)
02

Linear models can only draw straight hyperplanes. If you feed them raw data where the relationship is y = x², they'll fail. But if you engineer a feature x_squared = x², suddenly the linear model can fit it perfectly. Feature engineering is about re-expressing your data in a form that matches what your algorithm can model.

Encoding is about converting categorical information (color, city, product type) into numbers without introducing false orderings or information loss. Scaling ensures all features compete on equal mathematical footing — without it, a feature measured in millions will dominate features measured in units, distorting distances and gradients. Imputation ensures your pipeline doesn't fail or silently produce NaN predictions when values are missing.

The highest-leverage feature engineering is domain-driven: creating features that encode expert knowledge about what drives the target. In credit fraud, 'number of transactions in the last 5 minutes' is more predictive than raw transaction amounts — but only someone who understands fraud patterns would think to engineer it. This is the gap between a mediocre model and a state-of-the-art one.

The Metaphor

"Raw data is like speaking in a foreign language to your model. Encoding converts it into the model's native numeric tongue. Scaling gives every word equal volume so the model can hear all of them. Imputation fills in the blanks where words are missing. Polynomial features and interactions teach the model new vocabulary it couldn't express before. Together, they ensure the model and data are speaking the same language fluently."

Beginner Mental Model

Think of feature engineering as a translation and enrichment step: (1) Translate categoricals to numbers (encoding), (2) normalize all numbers to the same scale (scaling), (3) fill in blanks (imputation), (4) add computed combinations that might be more informative than raw values (polynomial features, interactions). Your final feature matrix is like a well-prepared data table that any algorithm can read clearly.

03

Feature engineering is a function φ: X → Z that maps raw feature matrix X ∈ ℝⁿˣᵈ to a transformed feature matrix Z ∈ ℝⁿˣᵈ' (where d' may be larger or smaller than d) such that the functional relationship between Z and the target y is more efficiently learnable by the chosen model class. φ may include encoding, scaling, imputation, and construction of new derived features.

One-Hot Encoding (OHE)
Converts a categorical feature with k unique values into k binary indicator columns. The column for value v contains 1 if the sample has value v, else 0. Avoids implying any ordinal relationship between categories.
Ordinal Encoding
Maps categories to integers (e.g., 'low'→0, 'medium'→1, 'high'→2) preserving order. Only valid when a meaningful ordering exists — using it for non-ordinal features implies false order.
Target Encoding
Replaces a category with the mean of the target variable for that category (e.g., replace city='NYC' with average_house_price_in_NYC). Powerful but risks target leakage — must be computed inside each CV fold.
StandardScaler
Transforms each feature to have zero mean and unit variance: z = (x - μ) / σ. Critical for gradient descent, SVM, PCA, and any distance-based method.
MinMaxScaler
Scales each feature to a fixed range [0, 1]: z = (x - min) / (max - min). Preserves zero if the original had zeros but is sensitive to outliers.
RobustScaler
Scales using the median and interquartile range: z = (x - median) / IQR. Robust to outliers — extreme values don't distort the scaling of other values.
Imputation
Filling in missing values. Strategies: mean/median/mode (simple), KNN (use k nearest neighbors' values), MICE/IterativeImputer (model-based, iterative).
Polynomial Features
Creating new features as powers (x², x³) and cross-products (x₁·x₂) of existing features. Enables linear models to fit non-linear relationships without changing the model class.
Interaction Terms
Products of pairs of features: x₁·x₂. Captures synergistic effects where the impact of x₁ on y depends on the value of x₂. A linear model cannot express interactions without these explicit terms.
Binning / Discretization
Converting a continuous feature into discrete bins (e.g., age into 'young/middle/senior'). Useful when the relationship is step-function-like or when you want to handle outliers robustly.
  1. 1. Exploratory data analysis: profile raw features (dtypes, missing %, unique values, distributions, correlations with target).
  2. 2. Identify feature types: continuous numeric, discrete numeric, nominal categorical, ordinal categorical, date/time, text.
  3. 3. Impute missing values: choose strategy based on missingness mechanism (MCAR, MAR, MNAR) and model type.
  4. 4. Encode categorical features: OHE for nominal (low cardinality), target encoding for high cardinality, ordinal encoding for ordered categories.
  5. 5. Scale numeric features: StandardScaler for algorithms sensitive to scale (linear models, SVM, neural networks, PCA); MinMaxScaler for neural networks; RobustScaler when outliers are present.
  6. 6. Engineer new features: domain-driven features, polynomial features, interaction terms, aggregations, time-based features.
  7. 7. Handle special cases: binary features (no scaling needed), datetime decomposition (hour, day, month, day-of-week), geospatial features (distance to centroid).
  8. 8. Validate: check for target leakage (features that directly encode the target), remove zero-variance and near-zero-variance features.

Raw feature matrix: mix of numeric (continuous and discrete), categorical (nominal and ordinal), datetime, and potentially text columns. May contain missing values.

Transformed numeric feature matrix Z ∈ ℝⁿˣᵈ' with no missing values, appropriate scales, encoded categoricals, and additional engineered features.

01Imputation assumes missing values can be estimated from other features (MAR or MCAR mechanism). MNAR (missing not at random) requires more sophisticated handling.
02StandardScaler assumes feature distributions are approximately Gaussian. Heavy-tailed distributions (income, log-normal) benefit from log transformation before scaling.
03Target encoding assumes categories seen in training will appear in test. Unknown categories at inference must be handled with a fallback (global mean).
04Polynomial feature expansion assumes the true relationship has polynomial structure — blindly expanding to degree 3+ on many features creates combinatorial explosion.
  • High cardinality categoricals (city with 10,000 unique values): OHE creates 10,000 columns — too many. Use target encoding, frequency encoding, or embedding.
  • Unseen categories at inference: OHE and ordinal encoding can fail when test data has categories not seen in training. Set handle_unknown='ignore' in OneHotEncoder.
  • Features with zero variance (constants): cause division-by-zero in StandardScaler. Remove with VarianceThreshold before scaling.
  • Non-positive values with log transformation: log(0) = -∞, log(negative) = undefined. Apply log1p (log(1+x)) for zero-containing features or add a constant shift.
04

Feature engineering is the first transformation in the ML pipeline, sitting between raw data ingestion and model training. All feature engineering steps must be encapsulated in a sklearn Pipeline or equivalent to prevent data leakage during cross-validation. The output of feature engineering is the feature matrix consumed by the model.

  • 01.Profile the data: df.info(), df.describe(), df.isnull().sum(), value_counts() for each categorical.
  • 02.Understand missingness: is it random (MCAR), depends on other features (MAR), or depends on the missing value itself (MNAR)?
  • 03.Visualize distributions: histograms for continuous, countplots for categorical, scatter plots vs. target. Identify skewness, outliers, multimodal distributions.
  • 04.Check target leakage: any feature with suspiciously high correlation (> 0.98) to the target is potentially leaky.
  • 05.Split train/test BEFORE any feature engineering to ensure no test information influences feature construction.
  • 01.Wrap all transformations in sklearn ColumnTransformer + Pipeline to ensure correct application in CV.
  • 02.Fit all transformers (scalers, encoders, imputers) on training data ONLY, transform both train and test.
  • 03.Add new engineered features before passing to the ColumnTransformer — computed columns can be added to a FunctionTransformer.
  • 04.Check for feature importance post-training: do engineered features rank above raw features? If not, reconsider the engineering logic.
  • 05.Iterate: feature engineering is empirical. Add features, measure CV improvement, keep what helps.

Imputation strategy

How missing values are filled: mean, median, most_frequent, or KNN (n_neighbors).

Median for skewed numerics; most_frequent for categoricals; KNNImputer(n_neighbors=5) for correlated missingness

Polynomial degree

Maximum degree for PolynomialFeatures. Degree 2 adds all x² and x₁x₂ terms; degree 3 adds x³, x₁²x₂, etc.

Degree 2 for most use cases. Degree 3+ causes exponential feature growth and overfitting.

Target encoding smoothing (m)

Regularization parameter for target encoding: encoded_value = (count × category_mean + m × global_mean) / (count + m).

m = 10 to 300 depending on dataset size. Larger m → smoother encoding → less overfitting.

  1. 1from sklearn.compose import ColumnTransformer; from sklearn.pipeline import Pipeline
  2. 2Identify numeric and categorical column names
  3. 3Build numeric transformer: Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
  4. 4Build categorical transformer: Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore'))])
  5. 5Combine: ColumnTransformer([('num', num_transformer, numeric_cols), ('cat', cat_transformer, cat_cols)])
  6. 6Add model: full_pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
  7. 7Fit: full_pipeline.fit(X_train, y_train). Transform is automatic.
05
06
python
1import numpy as np
2import pandas as pd
3
4# ── 1. StandardScaler from scratch ────────────────────────────────────────────
5class StandardScaler:
6    def __init__(self):
7        self.mean_ = None
8        self.std_  = None
9
10    def fit(self, X: np.ndarray) -> "StandardScaler":
11        self.mean_ = X.mean(axis=0)             # (d,)
12        self.std_  = X.std(axis=0, ddof=1)      # (d,) — sample std (ddof=1)
13        self.std_[self.std_ == 0] = 1.0          # avoid division by zero for constant features
14        return self
15
16    def transform(self, X: np.ndarray) -> np.ndarray:
17        return (X - self.mean_) / self.std_
18
19    def fit_transform(self, X: np.ndarray) -> np.ndarray:
20        return self.fit(X).transform(X)
21
22    def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
23        return Z * self.std_ + self.mean_
24
25
26# ── 2. MinMaxScaler from scratch ──────────────────────────────────────────────
27class MinMaxScaler:
28    def __init__(self, feature_range=(0, 1)):
29        self.min_   = None
30        self.range_ = None
31        self.lo, self.hi = feature_range
32
33    def fit(self, X: np.ndarray) -> "MinMaxScaler":
34        self.min_   = X.min(axis=0)
35        self.range_ = X.max(axis=0) - X.min(axis=0)
36        self.range_[self.range_ == 0] = 1.0
37        return self
38
39    def transform(self, X: np.ndarray) -> np.ndarray:
40        X_std = (X - self.min_) / self.range_
41        return X_std * (self.hi - self.lo) + self.lo
42
43    def fit_transform(self, X: np.ndarray) -> np.ndarray:
44        return self.fit(X).transform(X)
45
46
47# ── 3. Target Encoder with smoothing ─────────────────────────────────────────
48class TargetEncoder:
49    def __init__(self, smoothing: float = 30.0):
50        self.smoothing = smoothing
51        self.global_mean_ = None
52        self.category_stats_ = {}      # {col: {category: encoded_value}}
53
54    def fit(self, X: pd.DataFrame, y: pd.Series,
55            cols: list) -> "TargetEncoder":
56        self.global_mean_ = y.mean()
57        for col in cols:
58            stats = {}
59            for cat, group in y.groupby(X[col]):
60                n_c      = len(group)
61                cat_mean = group.mean()
62                # Smoothed target encoding
63                smoothed = (n_c * cat_mean + self.smoothing * self.global_mean_) / (n_c + self.smoothing)
64                stats[cat] = smoothed
65            self.category_stats_[col] = stats
66        return self
67
68    def transform(self, X: pd.DataFrame, cols: list) -> pd.DataFrame:
69        X_out = X.copy()
70        for col in cols:
71            X_out[col] = X[col].map(self.category_stats_[col]).fillna(self.global_mean_)
72        return X_out
73
74
75# ── 4. KNN Imputer from scratch ───────────────────────────────────────────────
76class SimpleKNNImputer:
77    """Imputes missing values using the mean of k nearest complete neighbors."""
78    def __init__(self, k: int = 5):
79        self.k        = k
80        self.X_train_ = None
81
82    def fit(self, X: np.ndarray) -> "SimpleKNNImputer":
83        # Store only complete rows for reference
84        self.X_train_ = X[~np.any(np.isnan(X), axis=1)]
85        return self
86
87    def transform(self, X: np.ndarray) -> np.ndarray:
88        X_out = X.copy().astype(float)
89        for i, row in enumerate(X):
90            missing_mask = np.isnan(row)
91            if not missing_mask.any():
92                continue
93            # Compute distance using observed features only
94            observed_mask = ~missing_mask
95            dists = np.sqrt(((self.X_train_[:, observed_mask] - row[observed_mask]) ** 2).sum(axis=1))
96            knn_idx = np.argsort(dists)[:self.k]
97            X_out[i, missing_mask] = self.X_train_[knn_idx][:, missing_mask].mean(axis=0)
98        return X_out
99
100
101# ── Demo: Full pipeline on synthetic data ─────────────────────────────────────
102np.random.seed(42)
103n = 500
104df = pd.DataFrame({
105    "age":    np.random.randint(18, 80, n).astype(float),
106    "income": np.random.lognormal(10, 1, n),
107    "city":   np.random.choice(["NYC", "LA", "Chicago", "Houston"], n),
108    "target": np.random.randn(n)
109})
110# Introduce missing values
111df.loc[np.random.choice(n, 50, replace=False), "age"]    = np.nan
112df.loc[np.random.choice(n, 30, replace=False), "income"] = np.nan
113
114# Split
115train_df = df.iloc[:400].copy()
116test_df  = df.iloc[400:].copy()
117y_train  = train_df.pop("target")
118
119# Apply transformations (fit on train only)
120scaler = StandardScaler()
121num_cols = ["age", "income"]
122X_train_num = scaler.fit_transform(train_df[num_cols].values)
123X_test_num  = scaler.transform(test_df[num_cols].values)
124print("After StandardScaler — train mean:", X_train_num.mean(axis=0).round(3),
125      "| std:", X_train_num.std(axis=0).round(3))
126
127enc = TargetEncoder(smoothing=30)
128enc.fit(train_df, y_train, cols=["city"])
129print("NYC encoding:", enc.category_stats_["city"].get("NYC", "N/A"))
The KNN imputer fits only on complete rows (no NaNs) to avoid using imputed values as reference points. Distance is computed using only the features that are observed in the query row — this is correct behavior and matches sklearn's KNNImputer. The TargetEncoder's smoothing parameter prevents rare categories from getting noisy estimates.
Raw DataFrame: age (float, 15% missing), income (float, log-normal, skewed), city (categorical, 50 unique values), sex (binary), fare (float, extreme outliers), target (binary).
Transformed matrix: age (standardized, imputed with median), income (log1p then standardized), city (target encoded), sex (one-hot encoded: 2 columns), fare (robust scaled). All NaN eliminated. Final shape: (n, 6+) numeric features ready for modeling.
  • The #1 rule: fit all transformers on training data ONLY. Apply the fitted transformer to test data. Violations cause data leakage — the most common cause of overly optimistic benchmark scores.
  • Wrap everything in a sklearn Pipeline + ColumnTransformer. This enforces the fit-on-train rule automatically and makes your preprocessing pipeline a single serializable object for deployment.
  • Target encoding must be computed inside each CV fold. If you compute target encodings on the full training set before CV, validation fold targets leak into the encodings — the model sees validation targets during training.
  • Log-transform highly skewed features (income, price, count data) before StandardScaler — it makes the distribution more Gaussian and prevents extreme values from dominating gradient updates.
  • RobustScaler is almost always better than StandardScaler or MinMaxScaler when outliers exist — it doesn't require winsorizing as a preprocessing step.
  • StandardScaling before train/test split — the test set's mean/std influences the scaler parameters, leaking test distribution into training.
  • Using OHE for high-cardinality categoricals (city with 5,000 unique values) — creates 5,000 columns, most nearly zero, causing memory issues and sparse model behavior.
  • Forgetting to set handle_unknown='ignore' in OneHotEncoder — fails at inference when test data contains categories not seen in training.
  • Applying log transform to features that can be zero or negative — log(0) = -∞. Use np.log1p (log(1+x)) for zero-containing features, or add a constant shift.
07
📊

Tabular / Structured Data

Excellent

Feature engineering is most impactful here. Raw tabular data almost always needs encoding, scaling, and imputation, plus often benefits from domain-driven derived features.

💡 This is the primary domain for feature engineering. Most real-world tabular ML problems can be significantly improved by careful feature engineering.
📈

Time Series Data

Excellent

Time-based feature extraction is critical: lag features (value at t-1, t-2), rolling statistics (mean/std over past 7d), seasonality indicators (day-of-week, month), Fourier features for periodicity.

💡 All time-based features must respect causality: only use past information at inference time. Lag features computed before CV can cause temporal leakage.
🖼️

Image Data

Poor

Modern image models (CNNs, Vision Transformers) learn feature representations end-to-end. Manual feature engineering (SIFT, HOG) is largely obsolete for deep learning pipelines. Useful only for classical ML on images.

💡 For classical ML on images: HOG + StandardScaler + SVM is still competitive on small datasets. For deep learning: let the network learn features.
💬

Text / NLP Data

Context-Dependent

Feature engineering for classical NLP (TF-IDF, n-grams, POS tags, sentiment) is highly effective with linear models. For transformer models (BERT, GPT), the model learns text features automatically.

💡 Hybrid approaches: use BERT embeddings as features, then engineer meta-features (text length, punctuation count, entity types) alongside.
📐

High-Dimensional Sparse Data

Good

Feature engineering includes dimensionality reduction (TruncatedSVD), interaction feature selection, and hash-based encoding (HashingVectorizer). StandardScaler is often skipped (MaxAbsScaler preferred for sparse matrices).

💡 Centering sparse data (StandardScaler) destroys sparsity — use MaxAbsScaler instead, which scales without centering.
🗺️

Geospatial / Location Data

Excellent

Distance to landmarks, cluster membership (k-means on lat/lon), grid-cell binning, Haversine distance features. Raw lat/lon coordinates are almost never useful directly — their interaction is the signal.

💡 Geospatial features often require domain knowledge: distance to nearest airport, population density, crime rate by zip code. Public datasets (census, OSM) are invaluable here.
08

Mandatory Visual Blueprint

What should move

At least one parameter, threshold, split, cluster state, or metric should change interactively.

What to observe

The learner should see how the concept affects error, fit, grouping, or decision quality.

Planned visual type

Interactive chart, step animation, or side-by-side failure-mode comparison.

Reference image slot

If no live lab exists yet, attach a relevant diagram/reference image before marking the page complete.

Topic key: feature-engineering

Effect of Scaling on Feature Distributions

Comparison of the same feature (income, log-normal distributed) before and after StandardScaler, MinMaxScaler, and RobustScaler. StandardScaler shifts and scales but doesn't change the shape. MinMaxScaler compresses to [0,1] but outliers distort all other values. RobustScaler centers on median — outliers have minimal impact on other values.

Comparison visualization data is documented in this section.

Encoding Cardinality Trade-off

Impact of encoding choice on feature matrix dimensionality and model AUC across different categorical feature cardinalities (number of unique values). OHE becomes impractical for high cardinality; target encoding handles arbitrary cardinality with minimal dimensionality increase.

Comparison visualization data is documented in this section.

Imputation Strategy Impact on Model AUC (Varying Missingness %)

Cross-validation AUC of a logistic regression model using different imputation strategies as the percentage of missing values increases. KNN imputation maintains performance longest; mean imputation degrades quickly with high missingness; median imputation is robust for skewed distributions.

Gradient descent convergence — MSE decreasing over iterations

09
  • Exposes signal that algorithms can't discover on their own

    A linear model cannot learn y = x₁/x₂ directly, but engineering a ratio feature and feeding it in makes the relationship trivially learnable. Feature engineering is often the only way to capture domain knowledge that the algorithm's hypothesis class can't represent.

  • Improves performance across all algorithm classes

    Good feature engineering helps linear models, tree-based models, SVMs, and neural networks alike. It's algorithm-agnostic — well-scaled, encoded, and imputed data is universally better than raw data with missing values and mixed scales.

  • Reduces model complexity required for the same performance

    A dataset with well-engineered features can often be solved by a simple logistic regression that would otherwise require a complex gradient boosted tree. Simpler models are faster, more interpretable, and more deployable.

  • Handles the data heterogeneity unavoidable in real-world problems

    Real data always mixes categorical, numeric, datetime, and missing values. Feature engineering is the systematic discipline for handling this heterogeneity — turning an unstructured mess into a clean numeric matrix.

  • Creates a documented, auditable, deployable transformation

    A sklearn Pipeline with fitted transformers is a serializable artifact that can be saved and loaded for deployment. Every transformation is explicit, reproducible, and auditable — no magic in the feature matrix.

  • Often more impactful than hyperparameter tuning

    Empirically, investing in feature engineering consistently provides larger performance gains than exhaustive hyperparameter optimization. A weak model on strong features usually outperforms a strong model on weak features.

  • Requires domain knowledge

    The best features come from understanding the domain — what drives the target variable, what signals are meaningful. Without domain expertise, feature engineering degenerates into blind transformations that may not capture meaningful patterns.

  • Risk of target leakage

    Features that encode information about the target variable (or future information) produce models that perform well in development but fail in production. Leaky features are subtle and hard to detect — e.g., a 'reason for visit' column filled in after the doctor's diagnosis that correlates with the diagnosis.

  • High-dimensional feature spaces increase overfitting risk

    Polynomial expansion and extensive interaction terms can create hundreds of features from a handful of originals. Without regularization, models overfit to these additional degrees of freedom. d >> n scenarios require careful regularization (L1/L2) after expansion.

  • Pipeline complexity and maintenance burden

    A production feature engineering pipeline with 15 transformers, custom encoders, and feature creation functions is complex to maintain. Every new data source or schema change may require pipeline updates — a significant operational burden.

  • Scaler parameters need monitoring in production for distribution drift

    StandardScaler parameters (mean, std) are learned on historical training data. If the data distribution shifts in production (new customer segment, seasonal patterns), the scaler becomes miscalibrated. Regular retraining or online updates are needed.

10
Finance / Credit

Credit risk modeling — velocity features from transaction history

Raw data: individual transactions with amount, merchant category, timestamp. Engineering: total spend in last 7/30/90 days per category, max single transaction, time since last transaction, ratio of international to domestic transactions. These features dramatically outperform raw transaction amounts in predicting default.

Retail / E-Commerce

Customer lifetime value and churn prediction

Raw data: purchase history logs. Engineering: recency (days since last purchase), frequency (purchases in last 90d), monetary (total spend), RFM segments, entropy of product categories (diversity), day-of-week purchase patterns, browsing-to-purchase ratio. RFM is a classic feature engineering framework for customer segmentation.

Healthcare

Clinical risk scoring (sepsis, readmission prediction)

Raw data: vitals time series, lab values, diagnoses codes. Engineering: rate of change of vitals (trend features), worst value in past 24h, number of distinct diagnoses in past year (comorbidity count), age × diagnosis interaction terms. Clinical guidelines (SOFA score, Charlson Comorbidity Index) encode expert feature engineering.

Real Estate

House price prediction — location and property features

Raw data: lat/lon, sq_footage, bedrooms, year_built. Engineering: distance to CBD, nearest school score, neighborhood average income, age (current_year - year_built), price per sq_foot of comparable nearby sales, floor-area ratio, room-count ratios. Location features from external datasets (school ratings, walkability scores) are the difference between a mediocre and excellent model.

Manufacturing / IoT

Predictive maintenance — sensor signal feature extraction

Raw data: vibration sensor readings at 10kHz. Engineering: mean, std, kurtosis, and skewness over rolling 1-second windows, peak-to-peak amplitude, dominant frequency via FFT, ratio of harmonic to fundamental frequency. These signal statistics compress 10,000 raw readings into ~20 meaningful features without losing the statistical signature of degradation.

NLP / Text Classification

Spam detection and content moderation

Raw data: email text. Engineering: TF-IDF unigrams + bigrams, email length, punctuation density, capitalization ratio, URL count, number of exclamation marks, presence of known spammer domains (domain feature from metadata). Structural text features complement TF-IDF effectively for rule-based filtering alongside ML.

11

Feature engineering encompasses a family of distinct transformation techniques. Here's how the key encoding and scaling methods compare:

One-Hot Encoding vs. Target Encoding

Both convert categorical features to numeric

OHE creates binary indicator columns — safe, no leakage, low cardinality only. Target encoding uses the target mean per category — powerful, risk of leakage, handles any cardinality. Target encoding needs smoothing and must be done inside CV folds.

OHE for cardinality ≤ 20. Target encoding for cardinality > 20, especially when ordinal relationship exists or cardinality is in hundreds/thousands.

StandardScaler vs. RobustScaler

Both center and scale features to comparable ranges

StandardScaler uses mean and std — fast, standard, sensitive to outliers. RobustScaler uses median and IQR — slower, more robust, ideal when outliers cannot be removed.

StandardScaler when data is clean and approximately Gaussian. RobustScaler when outliers are known to exist or when you can't afford to investigate/remove them.

Mean Imputation vs. KNN Imputation

Both fill in missing values with estimated replacements

Mean imputation uses the feature's training mean — fast, simple, distorts variance, ignores feature correlations. KNN uses nearest neighbor values — slower, captures correlations, better for high missingness with informative neighbors.

Mean/median for < 10% missingness in non-critical features. KNN for > 10% missingness or when missing features are strongly correlated with observed features.

Polynomial Features vs. Interaction Terms

Both create non-linear combinations of existing features

Polynomial features include powers (x²) plus interactions (x₁x₂). interaction_only=True creates only cross-products, not powers. Powers model U-shapes; interactions model synergistic effects.

Polynomial: when you suspect quadratic or cubic relationships (from residual plots). Interaction_only: when you have domain reason to believe features interact but not square terms.

MethodHandles NaN?CardinalityLeakage RiskSpeedOutlier Robust?
One-Hot Encoding✓ (mode fill)Low (≤20)✗ None⚡ FastN/A
Target Encoding✓ (global mean)Any✓ High⚡ FastN/A
Ordinal Encoding✓ (mode fill)Ordered only✗ None⚡ FastN/A
StandardScaler✗ NoN/A✗ None⚡ Fast✗ No
RobustScaler✗ NoN/A✗ None⚡ Fast✓ Yes
KNN Imputation✓ YesN/A✗ None🐢 SlowPartial
PolynomialFeatures✗ NoN/A✗ None⚡ Fast✗ No

You have structured tabular data with mixed feature types, missing values, and varying scales. Feature engineering is the mandatory preprocessing step before any model training on real-world tabular data.

12

Feature Importance (model-based)

Permutation importance: how much does model AUC drop when feature j's values are randomly shuffled? Large drop = feature is important. Zero drop = feature doesn't help. Negative drop = feature is introducing noise.

Target: Engineered features should have higher importance than their raw source features

Cross-Validation AUC/Metric Improvement

The most reliable evaluation: add a candidate feature, re-run CV, measure the delta. Positive and consistent delta = feature helps. Near-zero delta = feature is redundant. Negative delta = feature introduces noise or leakage.

Target: Any consistent positive CV improvement (> 0.002 AUC for stable estimates) justifies including the feature

Missing Value Rate

Features with > 70% missing values are rarely informative after imputation. However, even high missingness can be informative as a binary indicator: 'was this feature missing?' is sometimes predictive on its own.

Target: < 20% missing for reliable features; consider binary missingness indicator for features with > 30% missing

  1. 01.1. Baseline: train model on raw features after minimal preprocessing (median impute, OHE). Record CV AUC.
  2. 02.2. Add one engineered feature category at a time: scaling, then encoding improvement, then domain features.
  3. 03.3. Measure CV AUC after each addition. Record which features improve the metric.
  4. 04.4. Check feature importance: permutation importance or model coefficients. Drop features with near-zero importance.
  5. 05.5. Check for leakage: any feature with correlation > 0.9 to target in test set should be investigated immediately.
  6. 06.6. Final validation: report the best feature set's performance on the truly held-out test set once.
  • Target leakage: engineering features that encode future information or directly derive from the target. Symptom: suspiciously high CV score, dramatically worse test score.
  • Train-test skew: engineering features using statistics from the full dataset before the split — scalers fitted on full data, target encodings from all data. Fit on train, transform test.
  • Overfitting from too many engineered features: adding 100 polynomial features to 50 samples. Use regularization (Lasso, Ridge) alongside feature expansion.
  • Encoding unseen categories: OHE at inference fails on categories not seen during training. Always set handle_unknown='ignore'.

House price prediction: raw features (bedrooms, sqft, year_built, lat, lon) → CV RMSE = $45,200. After StandardScaler + OHE for neighborhood: CV RMSE = $38,100. After adding distance_to_downtown + school_score + age features: CV RMSE = $29,800. After adding sqft/bedroom interaction and log(sqft): CV RMSE = $26,400. Total improvement: -42% RMSE from feature engineering vs. raw features with same LinearRegression model. Algorithm choice (switching to XGBoost) only further reduces RMSE to $24,100 — feature engineering was the bigger driver.

13
  • ×Fitting scalers on the full dataset (including test) before train/test split — the most fundamental preprocessing leakage.
  • ×Using OHE for high-cardinality features like zip code (10,000+ values) without considering target encoding or embeddings.
  • ×Imputing with the mean without checking if the feature is skewed — skewed features should use median imputation.
  • ×Not handling unseen categories in test data (categories not in training) — OHE crashes or silently ignores them depending on implementation.
  • ×Not wrapping preprocessing in a sklearn Pipeline — fitting transformers separately and then combining breaks the automatic fit-on-train guarantee in cross-validation.
  • ×Computing target encoding on the full training set before cross-validation — validation fold target values leak into the encodings.
  • ×Forgetting to impute before scaling — sklearn's StandardScaler raises an error on NaN values by default.
  • ×Saving only the model object at deployment but not the fitted transformers — inference fails because raw features can't be transformed without the fitted scaler/encoder.
  • ×Saying 'always use StandardScaler' without knowing when RobustScaler or MinMaxScaler is more appropriate.
  • ×Not knowing why target encoding needs to be applied inside CV folds (the leakage mechanism).
  • ×Confusing one-hot encoding with dummy encoding (k-1 columns vs. k columns) and not knowing which sklearn uses.
  • ×Not knowing how to handle unseen categories at inference time (handle_unknown parameter in OneHotEncoder).
  • ×Applying feature engineering transformations at inference time using training-set statistics that have drifted — scaler parameters become stale, model predictions degrade silently.
  • ×Creating engineered features that can't be computed at inference time due to availability lag — e.g., 'average price in last 7 days' computed at midnight vs. the model needing it in real-time.
  • ×Not adding a 'missingness indicator' binary feature when imputing — the fact that a value was missing is often predictive and is discarded by simple imputation.
  • ×Leaking geospatial or temporal features computed from all data — e.g., 'city average house price' computed from the full dataset including test-period transactions.
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • ALL preprocessing (scaling, encoding, imputation) must be fit on training data ONLY — apply fitted transformers to test
  • Use sklearn Pipeline + ColumnTransformer to enforce this and create a deployable artifact
  • StandardScaler: z = (x - μ)/σ — zero mean, unit variance. Required for gradient descent, SVM, PCA
  • MinMaxScaler: z = (x - min)/(max - min) — sensitive to outliers. RobustScaler: uses median + IQR — robust to outliers
  • OHE: safe for low cardinality (≤ 20). Target encoding: handles any cardinality but must be inside CV folds
  • Impute: median for skewed numerics, mean for Gaussian, KNN for strongly correlated features
  • Polynomial features expand d features to O(d²) — use with regularization to prevent overfitting
StandardScaler
MinMaxScaler
RobustScaler
Target Encoding (smoothed)
Degree-2 feature count
  • All structured tabular datasets — feature engineering is mandatory, not optional
  • Any dataset with missing values, categorical features, or mixed scales
  • Competitive ML: feature engineering is the #1 differentiator in Kaggle competitions
  • Domains with rich domain knowledge that can be encoded into features
  • Deep learning on images, text, or audio — the network learns features end-to-end
  • When domain knowledge is unavailable and blind transformations add noise rather than signal
  • When the raw feature set already captures the problem perfectly (rare)
Explain why preprocessing must happen inside the CV loop (data leakage)
Know when to use OHE vs. target encoding vs. ordinal encoding
Compare StandardScaler, MinMaxScaler, RobustScaler: formulas and use cases
Explain smoothed target encoding formula and why smoothing is needed
Describe what polynomial features create and the feature count formula
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.