In Plain English
Logistic Regression takes a linear combination of features and squashes it through a sigmoid function to output a number between 0 and 1 — a probability. You then choose a threshold (usually 0.5) to make a binary classification decision.
Why It Exists
Linear regression applied to a binary target (0/1) produces predictions outside [0,1] — meaningless as probabilities. We need a model that naturally outputs valid probabilities and whose loss function is compatible with probability estimation under a Bernoulli likelihood.
Problem It Solves
Given labeled examples (input features, binary label), learn a probabilistic model P(y=1|x) = σ(w·x + b) that correctly separates two classes and outputs calibrated probabilities, enabling both hard classification and soft probability ranking.
Real-Life Analogy
"Imagine a doctor assessing whether a patient has a disease. They consider multiple symptoms (each with a weight reflecting importance), mentally sum up the evidence, and then convert that raw score into a probability like '73% chance of disease.' That conversion from a raw score to a probability is exactly the sigmoid function."
When To Use
- Binary classification where you need calibrated probabilities (spam/not, click/no-click)
- When interpretability matters — coefficients have a direct log-odds interpretation
- As a baseline before trying complex models (SVM, gradient boosting, neural networks)
- When the decision boundary is approximately linear in feature space
- When data is linearly separable or nearly so
- When you need a fast, reliable model with minimal tuning
When NOT To Use
- Relationship between features and log-odds is strongly non-linear
- You have significant class overlap that no linear boundary can separate
- Features are highly correlated without regularization (unstable coefficients)
- Target has more than 2 classes without extending to softmax (multinomial LR)
- You need to model complex feature interactions without engineering them manually
Linear regression gives you a score z = w·x + b that ranges from −∞ to +∞. The sigmoid function σ(z) = 1/(1+e^−z) compresses this score into (0, 1). When z is very positive, σ(z) → 1 (model is very confident it's class 1). When z is very negative, σ(z) → 0 (model is very confident it's class 0). At z = 0, σ(z) = 0.5 — perfect uncertainty. The decision boundary is exactly the hyperplane w·x + b = 0.
The model is trained by Maximum Likelihood Estimation (MLE): we want to find weights w that make the observed labels as probable as possible under our model. This leads to the Binary Cross-Entropy loss — a mathematically principled loss that penalizes confident wrong predictions extremely harshly (due to the log). Predicting p = 0.01 when y = 1 costs log(0.01) ≈ 4.6, but predicting p = 0.99 when y = 0 costs log(0.01) ≈ 4.6 equally.
The gradient of the cross-entropy loss with respect to w has a surprisingly clean form: (ŷ - y)·x — exactly the same structure as linear regression's gradient. This means logistic regression can be trained with identical gradient descent machinery as linear regression. The only differences are the prediction step (sigmoid) and the loss function (cross-entropy vs. MSE).
The Metaphor
"Think of logistic regression as a bouncer at a club. The bouncer has a checklist (features): age, attire, guest list. They add up all the signals (linear combination), and this raw score determines how likely they think you'll get in. But their decision is binary — in or out. The sigmoid maps the raw score to a probability, and the threshold (0.5) determines who gets in. The club's historical data trains the bouncer's instincts."
Beginner Mental Model
Step 1: compute z = w₁x₁ + w₂x₂ + ... + b (same as linear regression). Step 2: apply sigmoid: p = 1/(1+e^−z). Step 3: if p ≥ 0.5, predict class 1; else predict class 0. Training finds the w's that make probabilities high for correct classes and low for wrong ones.
Formal Definition
Given dataset {(x⁽ⁱ⁾, y⁽ⁱ⁾)}ᵢ₌₁ⁿ with x⁽ⁱ⁾ ∈ ℝᵈ and y⁽ⁱ⁾ ∈ {0,1}, logistic regression models P(y=1|x;w,b) = σ(wᵀx + b) where σ(z) = 1/(1+e^−z). Parameters w ∈ ℝᵈ and b ∈ ℝ are found by maximizing the Bernoulli log-likelihood, equivalently minimizing Binary Cross-Entropy: L(w,b) = −(1/n)Σᵢ[y⁽ⁱ⁾log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)log(1−ŷ⁽ⁱ⁾)].
Key Terms
- Sigmoid Function σ(z)
- Maps any real number to (0,1): σ(z) = 1/(1+e^−z). Key properties: σ(0) = 0.5, σ'(z) = σ(z)(1−σ(z)), differentiable everywhere, S-shaped curve.
- Log-Odds (Logit)
- logit(p) = log(p/(1−p)) = wᵀx + b. The linear combination is modeling the log of the odds ratio, not the probability itself. The inverse of the logit is the sigmoid.
- Binary Cross-Entropy (BCE) / Log-Loss
- The loss function: −[y·log(ŷ) + (1−y)·log(1−ŷ)]. Derived from negative log-likelihood of a Bernoulli distribution. Penalizes confident wrong predictions logarithmically.
- Decision Boundary
- The hyperplane wᵀx + b = 0 where p = 0.5. Points on one side are classified as class 1, other side as class 0. It is always linear in the original feature space.
- Maximum Likelihood Estimation (MLE)
- Training principle: find w that maximizes P(y|X;w) = Πᵢ ŷᵢyᵢ(1−ŷᵢ)(1−yᵢ). Taking the log and negating gives the cross-entropy loss. MLE is asymptotically consistent and efficient.
- Softmax
- Generalization of sigmoid to K classes: softmax(zₖ) = e^zₖ / Σⱼ e^zⱼ. Outputs sum to 1. Reduces to sigmoid when K=2. Used in multinomial logistic regression and neural network output layers.
- Calibration
- A model is calibrated if predicted probability p truly reflects the fraction of samples with y=1 in a large group. Logistic regression is naturally well-calibrated; decision trees and SVMs are not.
Step-by-Step Working
- 1. Collect training data: n samples (x⁽ⁱ⁾, y⁽ⁱ⁾) where y ∈ {0, 1}.
- 2. Initialize weights w = 0 (or small random values), bias b = 0.
- 3. For each training sample (or mini-batch), compute the linear score: z = wᵀx + b.
- 4. Apply sigmoid: ŷ = σ(z) = 1/(1+e^−z). This is the predicted probability P(y=1|x).
- 5. Compute cross-entropy loss: L = −[y·log(ŷ) + (1−y)·log(1−ŷ)].
- 6. Compute gradient: ∂L/∂w = (1/n)Xᵀ(ŷ − y), ∂L/∂b = (1/n)Σ(ŷᵢ − yᵢ).
- 7. Update: w ← w − α·∂L/∂w, b ← b − α·∂L/∂b.
- 8. Repeat until convergence (loss plateaus or max iterations reached).
- 9. Prediction: compute z for new x, apply sigmoid, threshold at 0.5 (or custom threshold).
Inputs
Feature matrix X ∈ ℝⁿˣᵈ (numeric features, encoded categoricals). Binary labels y ∈ {0,1}ⁿ.
Outputs
Probability P(y=1|x) ∈ (0,1) per sample. Binary prediction via threshold: ŷ ∈ {0,1}.
Model Assumptions
Important Edge Cases
- ▸Perfect separation (linearly separable data): MLE has no finite solution — weights grow to infinity, probabilities go to 0/1. Fix: L2 regularization bounds weights.
- ▸Class imbalance: minority class probabilities are underestimated. Fix: adjust threshold, use class_weight='balanced', or upsample minority class.
- ▸Multicollinearity: coefficients are unstable and uninterpretable. Fix: Ridge (L2) regularization (C parameter in sklearn).
- ▸Extreme feature values cause sigmoid saturation (z >> 0 or z << 0): gradient ≈ 0, learning stalls. Fix: feature scaling.
Role in the ML Pipeline
Logistic Regression sits at the end of the preprocessing pipeline (after encoding, scaling, feature selection) and outputs probabilities for downstream decision-making or ranking. It can also be stacked as a meta-learner in ensemble methods.
Data Preprocessing
- 01.Scale features: StandardScaler is essential — sigmoid saturates with extreme z values, and unscaled features slow convergence.
- 02.Encode categoricals: one-hot encoding for nominal features, ordinal for ordered categories.
- 03.Handle class imbalance: use class_weight='balanced' or SMOTE oversampling for ratios worse than 1:10.
- 04.Handle missing values: impute before fitting (logistic regression has no native missing value handling).
- 05.Remove or regularize multicollinear features: check VIF; apply L2 regularization (C < 1 in sklearn).
- 06.Feature engineering: add polynomial or interaction terms if decision boundary is non-linear.
Training Process
- 01.Split data: 80/20 or stratified k-fold CV (use stratify=y to preserve class ratios in each fold).
- 02.Fit with regularization: LogisticRegression(C=1.0, solver='lbfgs', max_iter=1000).
- 03.Tune C via cross-validation: LogisticRegressionCV or GridSearchCV over [0.001, 0.01, 0.1, 1, 10, 100].
- 04.Evaluate with AUC-ROC (ranking quality) and log-loss (probability calibration quality).
- 05.Inspect coefficients: after StandardScaler, |wⱼ| reflects feature importance.
- 06.Calibrate if needed: CalibratedClassifierCV with method='isotonic' or 'sigmoid'.
Hyperparameters
Name
C (inverse regularization strength)
Description
C = 1/λ. Smaller C = stronger L2 regularization = more shrinkage. Default C=1.
Typical
0.01 to 100 (log scale search)
Name
penalty
Description
Type of regularization: 'l2' (Ridge), 'l1' (Lasso, sparse), 'elasticnet', or 'none'.
Typical
'l2' for most cases, 'l1' for feature selection
Name
solver
Description
Optimization algorithm: 'lbfgs' (default, L-BFGS-B quasi-Newton), 'saga' (supports L1+elasticnet, large datasets), 'liblinear' (small datasets).
Typical
'lbfgs' for most; 'saga' for L1 or large n
Name
class_weight
Description
Weights assigned to each class. 'balanced' sets weight inversely proportional to class frequency.
Typical
'balanced' for imbalanced datasets (ratio > 1:5)
Name
max_iter
Description
Maximum number of optimization iterations. Increase if solver fails to converge.
Typical
1000 for most datasets; increase if ConvergenceWarning appears
Implementation Checklist
- 1
pip install scikit-learn numpy - 2
Load data, inspect class distribution with y.value_counts() - 3
Preprocess: StandardScaler, encode categoricals, handle NaN - 4
Stratified train/test split: train_test_split(X, y, stratify=y, test_size=0.2) - 5
Fit: LogisticRegression(C=1.0, max_iter=1000).fit(X_train, y_train) - 6
Evaluate: roc_auc_score, log_loss, classification_report - 7
Tune C with LogisticRegressionCV or GridSearchCV
1import numpy as np
2
3class LogisticRegression:
4 def __init__(self, learning_rate=0.1, n_iterations=1000, C=1.0):
5 """
6 C: inverse regularization strength (L2). Larger C = weaker regularization.
7 Equivalent to sklearn's C parameter.
8 """
9 self.lr = learning_rate
10 self.n_iter = n_iterations
11 self.C = C # regularization parameter
12 self.weights = None
13 self.bias = None
14 self.loss_history = []
15
16 @staticmethod
17 def sigmoid(z):
18 # Numerically stable sigmoid — avoids exp overflow for large negative z
19 return np.where(z >= 0,
20 1 / (1 + np.exp(-z)),
21 np.exp(z) / (1 + np.exp(z)))
22
23 def fit(self, X, y):
24 n_samples, n_features = X.shape
25 self.weights = np.zeros(n_features)
26 self.bias = 0.0
27
28 for i in range(self.n_iter):
29 # Forward pass
30 z = X @ self.weights + self.bias # (n,)
31 y_hat = self.sigmoid(z) # (n,) — predicted probabilities
32
33 # Binary Cross-Entropy loss (with L2 regularization)
34 eps = 1e-15 # numerical stability: avoid log(0)
35 y_hat_clipped = np.clip(y_hat, eps, 1 - eps)
36 bce = -np.mean(y * np.log(y_hat_clipped) + (1 - y) * np.log(1 - y_hat_clipped))
37 l2_penalty = (1 / (2 * self.C)) * np.sum(self.weights ** 2)
38 loss = bce + l2_penalty
39 self.loss_history.append(loss)
40
41 # Gradients — note the remarkably clean form: error × features
42 error = y_hat - y # (n,)
43 dw = (1 / n_samples) * X.T @ error + (1/self.C) * self.weights
44 db = (1 / n_samples) * error.sum()
45
46 # Update
47 self.weights -= self.lr * dw
48 self.bias -= self.lr * db
49
50 return self
51
52 def predict_proba(self, X):
53 """Returns probability of class 1 for each sample."""
54 z = X @ self.weights + self.bias
55 return self.sigmoid(z)
56
57 def predict(self, X, threshold=0.5):
58 """Hard classification at given threshold."""
59 return (self.predict_proba(X) >= threshold).astype(int)
60
61 def score(self, X, y):
62 """Accuracy."""
63 return np.mean(self.predict(X) == y)
64
65
66# ── Demo ──────────────────────────────────────────────────────────────────────
67from sklearn.datasets import make_classification
68from sklearn.model_selection import train_test_split
69from sklearn.preprocessing import StandardScaler
70
71np.random.seed(42)
72X, y = make_classification(n_samples=1000, n_features=10, n_informative=6,
73 random_state=42)
74
75X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
76
77scaler = StandardScaler()
78X_train_s = scaler.fit_transform(X_train)
79X_test_s = scaler.transform(X_test)
80
81model = LogisticRegression(learning_rate=0.1, n_iterations=500, C=1.0)
82model.fit(X_train_s, y_train)
83
84print(f"Weights (top 3): {model.weights[:3].round(4)}")
85print(f"Accuracy: {model.score(X_test_s, y_test):.4f}")
86
87# Probability predictions for first 5 test samples
88probs = model.predict_proba(X_test_s[:5])
89print(f"P(y=1) for samples 0-4: {probs.round(3)}")Sample Input
X = [[0.5, -1.2, 0.8], [-0.3, 0.7, -0.5], [1.2, -0.8, 1.5]] y = [1, 0, 1]
Sample Output
Weights: [0.8731, -0.6204, 0.9112] Bias: 0.0412 P(y=1) = [0.791, 0.213, 0.884] Predictions: [1, 0, 1] AUC-ROC (5-fold CV): 0.8734 ± 0.0218
Key Implementation Insights
- →The sigmoid's derivative σ'(z) = σ(z)(1-σ(z)) is maximized at z=0 (value 0.25) and vanishes for |z| >> 0. This is why feature scaling is critical — extreme z values kill the gradient.
- →Cross-entropy loss is not bounded above (unlike MSE). A single confident wrong prediction can dominate the average loss. Always monitor per-sample losses during debugging.
- →Logistic regression with L2 regularization (C < ∞) always converges even with perfect separation. Without regularization, weights diverge to ±∞ with perfectly separable data.
- →predict_proba outputs are calibrated by construction under the Bernoulli likelihood assumption. If the model is misspecified (non-linear boundary), recalibrate with CalibratedClassifierCV.
- →For multiclass, sklearn uses one-vs-rest (OvR) by default when solver='lbfgs'. Use multi_class='multinomial' for true softmax — often better when classes overlap.
Common Implementation Mistakes
- ✗Not scaling features — sigmoid saturates with large z values, training stalls.
- ✗Using accuracy as the only metric for imbalanced classes — always report AUC-ROC and log-loss.
- ✗Forgetting to stratify train/test split — small minority class can disappear from one split.
- ✗Setting C too large on small datasets — logistic regression overfits without regularization.
- ✗Interpreting coefficients without considering multicollinearity — correlated features have unreliable individual coefficients.
Small Tabular Dataset (< 1K rows)
Works well with regularization (L2). Small n requires careful cross-validation (leave-one-out or 10-fold stratified CV). Coefficients may be noisy without large n.
Large Tabular Dataset (> 1M rows)
Scales linearly with n via gradient descent. SGD variants (SGDClassifier with log_loss) handle streaming/online data at billions of samples. Very fast inference: O(d) per prediction.
Imbalanced Classes (1:100 ratio)
Naturally biases toward majority class without adjustment. With class_weight='balanced' or adjusted thresholds, handles imbalance well. AUC-ROC is the right metric, not accuracy.
High-Dimensional Data (d >> n)
Without regularization, coefficients are undefined (infinite solutions). With strong L1 regularization (Lasso), performs feature selection and can handle d > n reasonably.
Non-linearly Separable Data
Decision boundary is always linear. XOR, concentric circles, and other non-linear patterns require feature engineering (polynomial features) or a non-linear model.
Text / Bag-of-Words Features
Historically one of the best models for text classification with TF-IDF features. Sparse, high-dimensional, but linear separability often holds. Fast to train, interpretable coefficients.
Interactive: Sigmoid, Decision Threshold, Confusion Matrix
Precision
0.60
Recall
1.00
Sigmoid Function and Decision Threshold
The sigmoid σ(z) maps the linear score z to probability. As z increases from −6 to +6, probability transitions smoothly from 0 to 1. The decision boundary at z=0 gives p=0.5.
● Data points · — Regression line (ŷ = 0.11x + 0.50)
Binary Cross-Entropy Loss vs. Predicted Probability
Shows how BCE loss penalizes confidence. For y=1, loss = −log(p): predicting p=0.1 costs 2.3 but predicting p=0.9 costs only 0.1. The logarithmic penalty strongly discourages confident wrong predictions.
Gradient descent convergence — MSE decreasing over iterations
Training Log-Loss Convergence
Binary cross-entropy loss decreasing over gradient descent iterations. A smooth, monotonically decreasing curve indicates good learning rate choice. Oscillations indicate learning rate too high.
Gradient descent convergence — MSE decreasing over iterations
Advantages
Outputs calibrated probabilities
Unlike SVMs or trees, logistic regression directly models P(y=1|x) as a proper probability. These probabilities are well-calibrated under correct model specification — critical for risk scoring, medical diagnosis, and any application needing uncertainty quantification.
Highly interpretable coefficients
Each coefficient wⱼ corresponds to an odds ratio e^wⱼ: a unit increase in xⱼ multiplies the odds of class 1 by e^wⱼ. This is legally defensible for credit decisions and medically auditable for clinical tools.
Extremely fast training and inference
Training is O(nd) per gradient step — linear in both n and d. Inference is O(d) per prediction. Trained models are trivially small (one float per feature). Deployable in microseconds on any hardware.
No hyperparameter tuning required for quick baseline
Default settings (C=1, L2 regularization, lbfgs) work well on most clean, scaled datasets. A useful baseline is runnable in 3 lines of sklearn code. Contrast with neural networks, GBT, or SVMs which require extensive tuning.
Works excellently on high-dimensional sparse data
With L1 regularization, logistic regression performs automatic feature selection and scales to millions of features (text, genomics). This is why it remains a top choice for NLP bag-of-words classification.
Convex loss function — guaranteed global optimum
Binary cross-entropy is strictly convex in the weights. Gradient descent always converges to the unique global minimum. No local minima, no saddle point issues. Training is reliable and reproducible.
Limitations
Strictly linear decision boundary
The decision boundary is always a hyperplane: wᵀx + b = 0. Cannot model XOR, circles, spirals, or any non-linear class separation without manual feature engineering. This fundamentally limits expressiveness.
Fails with perfect class separation
When training data is perfectly linearly separable, MLE has no finite solution — weights diverge to ±∞ as the model tries to push probabilities to exactly 0 and 1. L2 regularization prevents this but requires careful tuning.
Sensitive to feature scale and outliers in feature space
Extreme feature values push z far from 0, saturating the sigmoid and killing gradients. Outliers with extreme features can disproportionately influence the decision boundary.
Assumes conditional independence of features given class
Like Naive Bayes, logistic regression struggles when features have complex interactions that matter for classification. Interaction terms must be manually engineered.
Requires feature engineering for non-linear relationships
To model a curved decision boundary, you must explicitly add polynomial features, radial basis functions, or other transformations. This requires domain knowledge and increases the feature space dramatically.
Credit default scoring
Model P(default) from income, debt-to-income ratio, credit history, age. Regulatory compliance (Basel III, GDPR right-to-explanation) demands interpretable coefficients. Logistic regression is the industry-standard 'scorecard' model.
Disease probability estimation
Predict P(disease|symptoms, lab results, demographics). Coefficients translate directly to clinical guidelines: 'A 10-unit increase in PSA level multiplies prostate cancer odds by 1.8'. FDA requires explainability.
Click-through rate (CTR) prediction
Predict P(click|user features, ad features, context). Served billions of times per day — linear inference speed is mandatory. Google and Meta historically used FTRL-optimized logistic regression at massive scale.
Spam and phishing detection
Classify emails as spam/ham based on word frequencies, sender reputation, URL features. Bag-of-words + L1 logistic regression is a classic, interpretable baseline. Coefficients identify the most spam-predictive keywords.
Purchase conversion prediction
Estimate P(purchase|session features, user history, product attributes). Powers real-time personalization — must be fast. Predicted probabilities are used to rank products and personalize email timing.
A/B test outcome modeling
Model conversion rates as a function of variant assignment and user covariates. Logistic regression with interaction terms captures heterogeneous treatment effects (which user segments respond better to variant B).
Logistic regression is the linear classifier that all other classifiers are compared against. Understanding its trade-offs is essential for model selection.
Support Vector Machine (SVM)
Similarity
Both find a linear decision boundary for binary classification
Key Difference
SVM maximizes the margin between classes using a hinge loss. Doesn't output probabilities natively (needs Platt scaling). Better with small datasets and clear margins; doesn't scale as well to n > 100K.
Choose When
When data has small n with large margin separation; kernel SVM for non-linear boundaries with small data.
Decision Tree
Similarity
Both classify binary targets
Key Difference
Tree splits on one feature at a time — non-linear, axis-aligned boundaries. No probability calibration. Interpretable via tree visualization but unstable (high variance). Can overfit without pruning.
Choose When
When features interact strongly and non-linearly; when decision rules must be visualizable as if-then-else logic.
Random Forest
Similarity
Both output probabilities for binary classification
Key Difference
Ensemble of trees — non-linear, robust, handles interactions natively. Not interpretable at coefficient level. Slower training and inference. Requires more hyperparameter tuning.
Choose When
When logistic regression underfits (non-linear patterns); when you don't need single-coefficient interpretability.
Naive Bayes
Similarity
Both linear classifiers (in log space) for binary/multiclass classification
Key Difference
Naive Bayes assumes feature independence given the class label and models P(xⱼ|y) directly (generative). Logistic regression is discriminative (models P(y|x) directly). LR typically outperforms NB with enough data.
Choose When
Naive Bayes for tiny datasets or text classification when speed is paramount and features are truly independent.
| Property | Logistic Reg. | SVM | Random Forest | Naive Bayes |
|---|---|---|---|---|
| Calibrated probabilities | ✓ Yes | ✗ (needs Platt) | Partially | ✓ Often |
| Linear boundary | ✓ Yes | ✓ Yes (linear kernel) | ✗ No | ✓ Yes |
| Interpretable | ✓ Coeff. | Partial (support vectors) | ✗ Limited | ✓ Prior/likelihood |
| Handles non-linearity | ✗ No | ✓ Kernel trick | ✓ Yes | ✗ No |
| Training speed | ⚡ Fast | 🐢 Slow (RBF) | 🐢 Moderate | ⚡ Very fast |
| Requires feature scaling | ✓ Yes | ✓ Yes | ✗ No | ✗ No |
Choose Logistic Regression when:
You need calibrated probabilities, interpretable coefficients, a fast baseline, or are working with sparse high-dimensional features (text). Default first choice for binary classification before trying complex models.
AUC-ROC (Area Under ROC Curve)
Probability that a randomly chosen positive sample is scored higher than a randomly chosen negative. AUC=0.5 is random, AUC=1 is perfect. Threshold-invariant — measures ranking quality.
Target: > 0.8 is typically good; > 0.9 is excellent
Log-Loss (Binary Cross-Entropy)
Measures probability calibration quality. Log-loss = 0.693 is the random baseline (always predict 0.5). Lower is better. Sensitive to confident wrong predictions.
Target: < 0.3 is good for well-separated classes; compare to random baseline 0.693
Precision / Recall / F1
Precision: of predicted positives, how many are truly positive. Recall: of actual positives, how many did we catch. F1 balances both. Use for imbalanced classes where accuracy is misleading.
Target: Domain-dependent; tune threshold to balance precision vs. recall for your cost structure
Calibration (Brier Score)
Mean squared error between predicted probabilities and actual outcomes. A calibrated model has predicted probability 0.7 truly corresponding to 70% positive rate. Brier score 0 = perfect, 0.25 = random.
Target: < 0.1 indicates good calibration; check with reliability diagram (calibration curve)
Evaluation Process
- 01.1. Use stratified k-fold CV (k=5 or 10) — preserves class distribution in every fold.
- 02.2. Report AUC-ROC as primary metric (threshold-invariant ranking quality).
- 03.3. Report log-loss to assess probability calibration.
- 04.4. Plot ROC curve and Precision-Recall curve (PR curve better for imbalanced data).
- 05.5. Choose decision threshold based on business cost: F-beta score with β weighting precision vs. recall.
- 06.6. Plot calibration curve (reliability diagram): compare predicted probability deciles to actual positive rates.
Evaluation Traps
- ▸Using accuracy as the sole metric — trivially 99% accuracy on 99:1 imbalanced data by predicting all negatives.
- ▸Evaluating AUC on training data — always use held-out or CV AUC.
- ▸Not checking calibration — high AUC doesn't guarantee well-calibrated probabilities.
- ▸Optimizing for the wrong threshold — default 0.5 is rarely optimal; tune based on false positive vs. false negative cost.
Real-World Interpretation Example
Credit default model: AUC-ROC = 0.84, Log-Loss = 0.31, Precision = 0.71, Recall = 0.68, F1 = 0.69. Interpretation: The model ranks defaults above non-defaults 84% of the time. Log-loss well below random baseline (0.693). At the 0.5 threshold, it catches 68% of actual defaults with 71% precision — meaning 29% of 'predicted defaults' are false alarms. For a high-cost loan, you'd lower the threshold to increase recall at the cost of more false alarms.
Students
- ×Applying logistic regression to a multi-class problem with only binary outputs — must use multi_class='multinomial' or one-vs-rest.
- ×Interpreting coefficients as probabilities instead of log-odds — a coefficient of 2.0 means the odds multiply by e²≈7.4, not the probability increases by 200%.
- ×Not standardizing features before fitting — sigmoid saturates, gradients vanish, coefficients are uncomparable.
- ×Using accuracy for model selection with imbalanced classes — always use AUC-ROC or F1.
Developers
- ×Fitting StandardScaler on train+test combined before splitting — data leakage that inflates performance metrics.
- ×Setting max_iter too low — ConvergenceWarning means the model hasn't converged, weights are suboptimal.
- ×Ignoring class_weight for imbalanced data — model predicts majority class almost exclusively.
- ×Using solver='lbfgs' with penalty='l1' — lbfgs doesn't support L1; use 'saga' or 'liblinear'.
In Interviews
- ×Saying logistic regression outputs a class directly — it outputs a probability; the class comes from a threshold.
- ×Not knowing that BCE loss is derived from MLE — saying 'we chose cross-entropy because it works well' misses the probabilistic foundation.
- ×Confusing logistic regression with linear regression applied to classification — the key difference is the sigmoid and the loss function.
- ×Not knowing what 'log-odds' means — interviewers test whether you can interpret coefficients properly.
Real Projects
- ×Deploying without probability calibration check — in recommendation or risk systems, uncalibrated probabilities lead to bad decisions.
- ×Not handling perfect separation — if training data has a feature that perfectly separates classes, default sklearn may not warn you and just return very large weights.
- ×Using logistic regression when the positive rate changes significantly over time — probability outputs become miscalibrated as distribution shifts; retrain regularly.
- ×Not logging predicted probabilities in production — impossible to monitor calibration drift without probability logs.
What kind of bias does this model have?
Linear assumptions create bias when relationships are strongly non-linear.
What kind of variance does it have?
Usually lower variance than high-capacity non-linear models.
How does it overfit?
Overfitting usually appears as strong train performance but weaker validation/test behavior.
How do we regularize it?
Use L1/L2 regularization, feature pruning, and stronger validation controls.
What kind of data does it like?
Works best with clean, informative features and stable train/serve distributions.
What kind of data breaks it?
Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.
Quick Revision Reference
Key Takeaways
- Models P(y=1|x) = σ(wᵀx + b) where σ is the sigmoid function
- Trained by minimizing Binary Cross-Entropy = −(1/n)Σ[y·log(ŷ) + (1−y)·log(1−ŷ)]
- Gradient is clean: ∂L/∂w = (1/n)Xᵀ(ŷ − y) — same structure as linear regression
- Decision boundary is linear: wᵀx + b = 0
- Coefficient wⱼ = log-odds ratio; e^wⱼ = odds multiplier per unit increase in xⱼ
- Always use L2 regularization (C parameter) to prevent divergence with separable data
- Use AUC-ROC + log-loss for evaluation; not accuracy alone
- Softmax extends logistic regression to K classes
Critical Formulas
Best For
- ✓Binary classification with interpretability requirement
- ✓Calibrated probability outputs for risk/decision systems
- ✓High-dimensional sparse data (text, genomics) with L1
- ✓Fast production baseline before complex models
Avoid When
- ✗Non-linear decision boundary required
- ✗Severe class overlap or XOR-type patterns
- ✗High-dimensional data without regularization
- ✗Perfect class separation in training data without L2
Interview Must-Know
These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.