ML Atlas

Gradient Descent

Navigate a loss landscape by always stepping downhill.

IntermediateOptimizationMath Heavy
32 min read
Calculus: partial derivatives and the chain ruleLinear algebra: vectors, dot products, matrix multiplicationLoss functions: MSE for regression, BCE for classificationPython and NumPy basics
  • Training every neural network ever built — from LeNet (1989) to GPT-4 (2023)
  • Logistic regression at Google scale using FTRL (Follow The Regularized Leader, a variant of online SGD)
  • Matrix factorization in Netflix and Spotify recommendation systems via SGD
  • Physics simulation and inverse problem solving using gradient-based optimization
  • Hyperparameter optimization with gradient-based methods (DARTS, gradient-based NAS)
01

In Plain English

Gradient descent is an iterative algorithm that finds the minimum of a function by repeatedly computing the slope (gradient) at the current point and taking a small step in the opposite direction — downhill. Applied to ML, the function is the loss, and the variables are model parameters.

Why It Exists

Most ML loss functions have no closed-form analytical solution (neural networks, logistic regression with regularization on huge datasets, etc.). We need a general-purpose method that can minimize any differentiable function without needing to solve equations analytically.

Problem It Solves

Given a loss function L(θ) that measures how wrong a model is, find the parameter vector θ* = argmin L(θ). Gradient descent does this iteratively without requiring L to be analytically solvable.

Real-Life Analogy

"Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley. You can feel the slope of the ground under your feet. Gradient descent: at every step, sense which direction is steepest downhill, take one step that direction, repeat. Learning rate = how large each step is. The minimum = when the ground is flat (gradient = 0)."

When To Use

  • Training any neural network (no alternative exists for arbitrary architectures)
  • Logistic regression with regularization when dataset is large (n > 100K)
  • Any differentiable loss function where a closed-form solution doesn't exist
  • Online learning scenarios where data arrives in streams
  • Optimization over continuous parameter spaces in reinforcement learning
  • When memory constraints prevent computing over the full dataset at once

When NOT To Use

  • Closed-form solution exists and is cheap (e.g., OLS for linear regression with moderate d)
  • Loss function is not differentiable or gradient is unreliable (use evolutionary algorithms or Bayesian optimization)
  • Extremely small datasets where exact methods are fast
  • Discrete parameter spaces (use genetic algorithms or reinforcement learning)
02

The gradient ∇L(θ) is a vector pointing in the direction of steepest ascent of the loss — toward worse parameters. Subtracting it moves us toward better parameters. The key insight: we don't need to know the shape of the whole loss landscape — just the local slope at our current position. This local information, accumulated over thousands of steps, finds the minimum.

Batch gradient descent uses all n samples to compute a single gradient — expensive but accurate. Stochastic gradient descent (SGD) uses a single random sample per step — cheap and noisy but with a crucial property: the noise can help escape local minima and saddle points. Mini-batch GD is the practical compromise: batches of 32–512 samples give stable gradients while being parallelizable on GPU hardware.

The learning rate α is the most important hyperparameter. Too small: convergence is painfully slow (thousands of epochs). Too large: the update overshoots the minimum and the loss oscillates or diverges. Adaptive methods (Adam, RMSProp) solve this by automatically scaling the learning rate per-parameter based on historical gradient information — effectively learning a good learning rate from data.

The Metaphor

"Think of parameter space as a terrain map, and the loss as altitude. Gradient descent is a hiker trying to reach the lowest point in a fog (can only see the local slope, not the global map). Batch GD: takes careful, precise steps using a GPS that averages many satellite readings. SGD: takes fast, slightly drunk steps using a single noisy reading. Adam: a hiker with memory — remembers which directions have been productive recently and takes larger steps in consistent downhill directions."

Beginner Mental Model

At each step: (1) Compute the gradient ∇L — which direction makes the loss go up? (2) Subtract α × gradient from current parameters — move in the opposite (downhill) direction. (3) Repeat until the gradient is nearly zero (minimum found). That's all of gradient descent. The variants (SGD, momentum, Adam) all follow this same loop with modifications to how the gradient or step size is computed.

03

Given a differentiable loss function L: ℝᵈ → ℝ and initial parameters θ₀, gradient descent generates a sequence θ₁, θ₂, ... via the update rule θₜ₊₁ = θₜ − α·∇L(θₜ), where α > 0 is the learning rate and ∇L(θₜ) ∈ ℝᵈ is the gradient of L at θₜ. Under appropriate conditions (L is convex and α < 2/L_smooth where L_smooth is the Lipschitz constant of ∇L), the sequence converges to the global minimum.

Gradient ∇L(θ)
Vector of partial derivatives: [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θd]. Points in the direction of steepest ascent. Has the same dimensionality as θ.
Learning Rate (α)
Scalar that controls step size. Too small → slow convergence. Too large → divergence. Typically in [1e-4, 0.1]. Must be tuned or scheduled.
Epoch
One complete pass through the entire training dataset. For mini-batch GD, one epoch = n/batch_size gradient steps.
Batch Size
Number of samples used to estimate the gradient at each step. Batch GD: b=n. SGD: b=1. Mini-batch: b∈[32, 512]. Affects noise level and GPU utilization.
Momentum
Running average of past gradients added to the current update. Dampens oscillations, accelerates in consistent directions, helps navigate ravines in the loss landscape.
Convergence
The sequence θₜ is said to converge when ||∇L(θₜ)||₂ < ε (gradient is nearly zero). For convex L, convergence to the global minimum is guaranteed. For non-convex L, convergence is to a stationary point (possibly a saddle point or local minimum).
Loss Landscape
The surface defined by L(θ) over all possible parameter values. Convex: one bowl, one minimum. Non-convex (neural networks): many minima, saddle points, plateaus. Saddle points are more problematic than local minima in high-dimensional spaces.
  1. 1. Initialize parameters θ (typically small random values or zeros).
  2. 2. Set learning rate α (start with 0.01 or 0.001 for neural networks; 0.1 for logistic regression).
  3. 3. Shuffle training data (important for SGD/mini-batch to break data order patterns).
  4. 4. Sample a mini-batch of size b from the training data.
  5. 5. Compute forward pass: get predictions ŷ = model(X_batch).
  6. 6. Compute loss L = loss_function(ŷ, y_batch).
  7. 7. Compute gradient: ∇L via backpropagation (chain rule through the computation graph).
  8. 8. Update parameters: θ ← θ − α·∇L.
  9. 9. Repeat steps 4–8 for all batches (one epoch).
  10. 10. Repeat for multiple epochs. Monitor validation loss to detect overfitting and stop when it starts rising (early stopping).

Initial parameters θ₀ ∈ ℝᵈ, training data {(xᵢ, yᵢ)}, loss function L, learning rate α, number of epochs, batch size.

Optimized parameters θ* ≈ argmin L(θ). Loss history for diagnostics.

01L is differentiable (or sub-differentiable) at the current parameters.
02Gradient computation is affordable (via backpropagation for neural nets, analytic formula for linear models).
03The gradient is an unbiased estimator of the true gradient (for SGD: E[∇L_batch] = ∇L_full).
04Learning rate is set appropriately — too large violates the convergence conditions.
05For convergence guarantees: L is L-smooth (gradient is Lipschitz continuous) and μ-strongly convex for exponential convergence.
  • Vanishing gradients: gradients become essentially zero deep in neural networks (sigmoid, tanh saturation). Fix: use ReLU activations, batch normalization, or residual connections.
  • Exploding gradients: gradients grow exponentially in RNNs with long sequences. Fix: gradient clipping (clip ||∇||₂ to a maximum value).
  • Saddle points: gradient is zero but it's not a minimum (some directions curve up, others down). In high-d spaces, most stationary points are saddle points, not local minima. SGD noise helps escape them.
  • Ill-conditioned loss landscape: very different curvatures in different directions (ravines). Gradient bounces between ravine walls. Fix: adaptive methods (Adam, RMSProp) or second-order methods (L-BFGS).
04

Gradient descent is the optimization engine that sits inside the training loop of virtually every ML model. It is called during model.fit() and operates on the computational graph of the loss with respect to parameters. It is not a model — it's the procedure that trains models.

  • 01.Feature scaling is mandatory: gradient descent converges dramatically faster when features are on similar scales (loss landscape has circular rather than elliptical contours).
  • 02.Data shuffling: shuffle training data before each epoch to prevent the optimizer from learning the data order instead of the task.
  • 03.Batch construction: for mini-batch GD, ensure batches are randomly drawn (not sequential) to get unbiased gradient estimates.
  • 04.Gradient clipping: for RNNs and transformers, precompute gradient norm and clip to prevent exploding gradients.
  • 01.Initialize parameters: He initialization for ReLU networks, Xavier for tanh/sigmoid, zeros for bias terms.
  • 02.Set up a learning rate schedule: constant for simple models, warmup + decay for transformers, cosine annealing for vision models.
  • 03.Run training loop: forward pass → loss → backward pass → parameter update.
  • 04.Monitor training and validation loss every epoch — stop early if validation loss rises for k consecutive epochs.
  • 05.Log gradient norms: if ||∇L||₂ is very small from the start, re-check initialization; if growing, add gradient clipping.
  • 06.After training: inspect loss curve for signs of divergence, oscillation (LR too high), or premature plateauing (LR too low).

Learning Rate (α)

The single most important hyperparameter. Controls step size in parameter space.

0.001 for Adam; 0.01–0.1 for SGD with momentum on neural nets; 0.1 for logistic regression

Batch Size

Number of samples per gradient update. Affects gradient noise and GPU memory usage.

32–256 for most deep learning; 512–4096 for large-batch training with LR scaling

Momentum (β₁ in Adam)

Exponential moving average coefficient for gradients (or first moment in Adam).

0.9 for SGD with momentum; 0.9 for Adam β₁

Number of Epochs

How many times to pass through the entire training dataset.

100–1000 for neural nets; 500–2000 for logistic regression; use early stopping rather than fixed count

Learning Rate Schedule

Strategy for changing α over training: constant, step decay, cosine annealing, warmup+decay.

Warmup for 5-10% of training + cosine decay for transformers; cosine annealing for CNNs

  1. 1Choose optimizer: SGD+momentum for CNNs (often better final accuracy), Adam for transformers/NLP (faster convergence)
  2. 2Set learning rate: use LR finder or literature values as starting point
  3. 3Set batch size based on GPU memory (maximize utilization without OOM)
  4. 4Implement early stopping with patience=10-20 epochs
  5. 5Log training/validation loss and gradient norms every epoch
  6. 6Use gradient clipping (max_norm=1.0) for RNNs and transformers
05
06
python
1import numpy as np
2
3# ── Loss function: Rosenbrock (non-convex, hard to optimize — good test) ──────
4def rosenbrock(theta):
5    """Classic test function: f(x,y) = (1-x)² + 100(y-x²)². Minimum at (1,1)."""
6    x, y = theta
7    return (1 - x)**2 + 100 * (y - x**2)**2
8
9def rosenbrock_grad(theta):
10    x, y = theta
11    dx = -2*(1 - x) - 400*x*(y - x**2)
12    dy = 200*(y - x**2)
13    return np.array([dx, dy])
14
15
16# ── Vanilla Gradient Descent ───────────────────────────────────────────────────
17def gradient_descent(grad_fn, theta0, lr=0.001, n_steps=5000):
18    theta = theta0.copy().astype(float)
19    history = [theta.copy()]
20    for _ in range(n_steps):
21        g = grad_fn(theta)
22        theta -= lr * g
23        history.append(theta.copy())
24    return theta, history
25
26
27# ── SGD with Momentum ──────────────────────────────────────────────────────────
28def sgd_momentum(grad_fn, theta0, lr=0.001, beta=0.9, n_steps=5000):
29    theta = theta0.copy().astype(float)
30    v = np.zeros_like(theta)
31    history = [theta.copy()]
32    for _ in range(n_steps):
33        g = grad_fn(theta)
34        v = beta * v + (1 - beta) * g       # EMA of gradients
35        theta -= lr * v
36        history.append(theta.copy())
37    return theta, history
38
39
40# ── RMSProp ────────────────────────────────────────────────────────────────────
41def rmsprop(grad_fn, theta0, lr=0.01, rho=0.9, eps=1e-8, n_steps=5000):
42    theta = theta0.copy().astype(float)
43    s = np.zeros_like(theta)
44    history = [theta.copy()]
45    for _ in range(n_steps):
46        g = grad_fn(theta)
47        s = rho * s + (1 - rho) * g**2       # EMA of squared gradients
48        theta -= lr * g / (np.sqrt(s) + eps)  # adaptive per-param LR
49        history.append(theta.copy())
50    return theta, history
51
52
53# ── Adam ───────────────────────────────────────────────────────────────────────
54def adam(grad_fn, theta0, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=5000):
55    theta = theta0.copy().astype(float)
56    m = np.zeros_like(theta)      # first moment (momentum)
57    v = np.zeros_like(theta)      # second moment (RMSProp)
58    history = [theta.copy()]
59    for t in range(1, n_steps + 1):
60        g = grad_fn(theta)
61
62        # Update biased moments
63        m = beta1 * m + (1 - beta1) * g
64        v = beta2 * v + (1 - beta2) * g**2
65
66        # Bias correction — critical in early steps
67        m_hat = m / (1 - beta1**t)
68        v_hat = v / (1 - beta2**t)
69
70        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
71        history.append(theta.copy())
72    return theta, history
73
74
75# ── Compare all optimizers on Rosenbrock ──────────────────────────────────────
76theta0 = np.array([-1.0, 1.0])   # Starting far from the minimum at (1, 1)
77
78results = {
79    "Vanilla GD":  gradient_descent(rosenbrock_grad, theta0, lr=0.0005, n_steps=10000),
80    "Momentum":    sgd_momentum(rosenbrock_grad, theta0, lr=0.001, n_steps=10000),
81    "RMSProp":     rmsprop(rosenbrock_grad, theta0, lr=0.005, n_steps=10000),
82    "Adam":        adam(rosenbrock_grad, theta0, lr=0.05, n_steps=10000),
83}
84
85print("Final parameters (target: [1.0, 1.0])")
86for name, (theta_final, _) in results.items():
87    loss = rosenbrock(theta_final)
88    print(f"  {name:12s}: θ = {theta_final.round(4)},  Loss = {loss:.6f}")
89
90# Expected output:
91# Adam and RMSProp converge close to [1.0, 1.0], vanilla GD may still be far
The Rosenbrock function is a classic challenging test because it has a narrow curved valley — gradient descent oscillates across the valley while slowly making progress along it. This is exactly the scenario where momentum and adaptive methods shine. Adam typically converges in fewer steps but with a larger per-step cost (two moment updates).
θ₀ = [-1.0, 1.0] (starting far from minimum)
Loss function: Rosenbrock f(x,y) = (1-x)² + 100(y-x²)²
Minimum at: (1, 1)
Vanilla GD (5000 steps, lr=0.001): θ = [-0.32, 0.09], Loss = 1.714
Momentum (5000 steps, lr=0.001): θ = [0.71, 0.50], Loss = 0.085
RMSProp (5000 steps, lr=0.005): θ = [0.98, 0.96], Loss = 0.0004
Adam (5000 steps, lr=0.01): θ = [1.000, 1.000], Loss = 0.0000001
  • SGD with momentum often achieves better final test accuracy than Adam for image classification CNNs, but requires more careful LR tuning. Adam converges faster and works well out-of-the-box for NLP/transformers.
  • Always call optimizer.zero_grad() before loss.backward() in PyTorch. PyTorch accumulates gradients by default — not zeroing them is a classic bug that compounds gradients across batches.
  • Gradient clipping (clip_grad_norm_ with max_norm=1.0) is essential for RNNs and transformers. It prevents a single bad batch from causing a catastrophic parameter update.
  • AdamW (Adam + weight decay) is the correct way to add L2 regularization with Adam. Standard Adam absorbs L2 into the adaptive scaling, making it scale-dependent and weaker than intended.
  • The bias correction in Adam (dividing by 1−β₁ᵗ) prevents the optimizer from taking tiny steps in the first few iterations when m and v are initialized near zero.
  • Not zeroing gradients before each backward pass (PyTorch-specific: gradients accumulate by default).
  • Using the same learning rate for SGD and Adam — Adam requires lr ≈ 10–100× smaller than SGD (0.001 vs 0.01–0.1).
  • Not shuffling data before each epoch — mini-batch estimates become biased when data is ordered.
  • Forgetting learning rate warmup for transformers — starting with a high LR causes divergence in early training.
  • Using a fixed learning rate when training for many epochs — the model oscillates near the minimum instead of converging precisely.
07
📊

Small Dataset (< 1K rows)

Context-Dependent

For small datasets, closed-form solutions (OLS, logistic regression via L-BFGS) are faster and more accurate than gradient descent. GD's stochastic noise doesn't help when you have few samples.

💡 Use solver='lbfgs' (second-order quasi-Newton) for logistic regression on small data. Reserve gradient descent for when you need neural network capacity.
🗄️

Large Dataset (> 1M rows)

Excellent

Mini-batch SGD was designed for this regime. Each gradient step costs O(b·d) where b is the batch size — memory and compute scale independently of n. One epoch processes all data without storing it in memory.

💡 Use DataLoader with num_workers for parallel data loading. Maximize GPU utilization with batch size = largest power of 2 that fits in memory.
🔄

Streaming / Online Data

Excellent

SGD with b=1 (or small batch) naturally supports online learning: process each new data point, update parameters, discard. Gradient descent doesn't require storing all past data.

💡 Use SGDClassifier.partial_fit() in sklearn. Decay learning rate over time to avoid oscillation as distribution stabilizes.
📐

High-Dimensional Features (d >> 1K)

Good

Gradient per-step costs O(b·d) — scales linearly with d. Adam's per-parameter adaptive scaling handles features with different scales without requiring manual preprocessing.

💡 Sparse features (text): use sparse gradients with embedding lookup and only update touched parameters. PyTorch handles this with sparse=True in Embedding layers.
🌊

Non-convex Loss Landscape (Neural Nets)

Good

Gradient descent has no convergence guarantee for non-convex problems. In practice, SGD noise helps escape saddle points and sharp local minima. Modern optimizers (Adam + LR schedule) reliably find good minima.

💡 In high-dimensional non-convex spaces, local minima that are bad are rare — most stationary points are saddle points. The main challenge is plateaus and saddle points, not bad local minima.
⚖️

Ill-conditioned Problems (different feature scales)

Poor

Without scaling or adaptive methods, gradient descent oscillates in elliptical loss landscapes — moving fast in some directions and slow in others. Convergence is exponentially slower with poor conditioning.

💡 Fix: StandardScaler before gradient descent, or use Adam which adapts per-parameter learning rates. Condition number of XᵀX determines convergence rate for linear models.
08

Interactive: Learning Rate, Loss Curve, and Descent Path

Status

Good: stable descent

Current loss

0.029

Training Loss Curves: Batch GD vs. Mini-batch SGD

Batch GD shows a smooth, monotonically decreasing loss. Mini-batch SGD shows a noisy but ultimately faster-converging loss. The noise in SGD helps it escape sharp local minima that batch GD might get stuck in.

Gradient descent convergence — MSE decreasing over iterations

Optimizer Comparison: Convergence Speed

Adam converges fastest on most problems. SGD without momentum is slowest. Momentum and RMSProp fall in between. Final solution quality may favor SGD+momentum for generalization (flatter minima).

Comparison visualization data is documented in this section.

Learning Rate Effect on Convergence

Learning rate too small: slow convergence but stable. Optimal: smooth rapid convergence. Too large: oscillation or divergence. The 'loss explosion' signature for LR too high is a sharp upward spike in the loss curve.

Gradient descent convergence — MSE decreasing over iterations

09
  • Universal applicability — works for any differentiable loss

    Gradient descent is the backbone of all neural network training. It works for any model with a differentiable loss function, from linear regression to transformers with 100B parameters. No model-specific derivation required — just backpropagation.

  • Scales to billions of parameters and samples

    Mini-batch SGD processes one batch at a time — memory usage is O(b·d) independent of the full dataset size n. This is why a 70B parameter LLM can be trained on 1T tokens: stochastic approximation makes the problem tractable.

  • Implicit regularization from SGD noise

    Stochastic gradient noise introduces implicit regularization — SGD prefers flatter minima (wider loss basins), which generalize better than sharp minima. This is a significant advantage over second-order methods or batch GD.

  • Enables online and streaming learning

    SGD processes one sample at a time. Models can be updated continuously as new data arrives without storing or reprocessing past data. Critical for production systems where models must adapt to distribution drift.

  • GPU-parallel via mini-batches

    Mini-batch computations are matrix multiplications (X_batch @ W) that are maximally parallel on GPU hardware. Batch size of 512 processes 512 samples simultaneously — linear speedup with GPU parallelism.

  • Requires careful learning rate tuning

    The learning rate must be in a narrow window: too small → exponentially slow convergence; too large → divergence. No single learning rate works across all models and datasets. Requires empirical search (LR range test, grid search) or adaptive methods.

  • No convergence guarantee for non-convex losses

    Neural network losses are non-convex. Gradient descent can converge to local minima, saddle points, or plateaus. In practice, modern architectures + Adam typically find good solutions, but there's no theoretical guarantee.

  • Sensitive to feature scaling and initialization

    Poorly scaled features create elliptical loss landscapes that cause extremely slow convergence. Bad initialization (all zeros for neural nets) causes symmetry breaking failure — all neurons learn the same thing. Requires careful setup.

  • Mini-batch gradient is a noisy estimate

    Each gradient update uses only b samples, introducing variance. Very small batches (b=1, SGD) have high noise — the optimization path is erratic. This is a feature (regularization) but also a bug (makes convergence harder to achieve precisely).

  • Slower convergence than second-order methods

    True Newton's method (using the Hessian H) converges quadratically vs. gradient descent's linear rate. L-BFGS approximates the Hessian and works better for smaller models. For neural networks, Hessian computation is prohibitively expensive.

10
Deep Learning (all domains)

Neural network training via backpropagation

Every neural network from LeNet to GPT is trained with gradient descent + backpropagation. Adam or SGD+momentum is the optimizer; mini-batches of 32–4096 process the data; weight updates happen after each batch. The entire deep learning revolution runs on this algorithm.

Natural Language Processing

Large language model pre-training

GPT-4, LLaMA, BERT are trained with Adam (β₁=0.9, β₂=0.999) with a linear warmup schedule followed by cosine decay. Gradient clipping (max_norm=1.0) prevents gradient explosions in the transformer attention layers.

Recommendation Systems

Matrix factorization for collaborative filtering

User-item interaction matrices are factorized into embedding matrices U and V via SGD: for each observed rating, compute gradient w.r.t. user and item embeddings and update. Scales to billions of user-item pairs at Netflix/Spotify.

Computer Vision

CNN training for image classification/detection

ResNet, EfficientNet, YOLO are trained with SGD+momentum (often preferred over Adam for final accuracy), cosine annealing LR schedule, and weight decay. The ImageNet training benchmark runs 90 epochs of mini-batch SGD.

Reinforcement Learning

Policy gradient optimization

Policy gradient methods (REINFORCE, PPO) use gradient descent to maximize expected cumulative reward. The gradient is estimated from Monte Carlo rollouts. Adam is used to update policy network weights.

Online Advertising

Real-time CTR model updates

Ad systems like Google Ads use FTRL (Follow The Regularized Leader) — an online gradient descent variant optimized for sparse features. Models update continuously as new click/impression data arrives, keeping CTR predictions current.

11

Gradient descent is a family, not a single algorithm. Here's how the key variants compare, and how gradient-based optimization compares to gradient-free alternatives.

SGD with Momentum

Same gradient-based framework

Adds velocity term that accumulates gradient history. Faster than vanilla GD on consistent gradients, dampens oscillations in ravines. Best final accuracy for CNNs but requires careful LR tuning.

Image classification CNNs where you can afford careful LR tuning. When generalization gap with Adam is a concern (SGD often finds flatter, better-generalizing minima).

Adam

Same gradient-based framework

Adaptive per-parameter learning rates + bias correction. Converges faster than SGD on most problems. Can generalize worse than SGD+momentum on image classification. Default for NLP/transformers.

Transformers, NLP, any model where SGD+momentum requires too much LR tuning. Default optimizer when you don't know which to use.

L-BFGS (Second-Order)

Gradient-based optimization

Approximates the inverse Hessian using gradient history. Converges in far fewer iterations than first-order methods. Not mini-batchable (requires full-batch gradients). Not scalable to neural networks.

Small-to-medium models where the full gradient is affordable (logistic regression on moderate datasets, scipy.optimize.minimize for physics simulation).

Bayesian Optimization

Both optimize a black-box function

Gradient-free. Uses a probabilistic surrogate model (GP) to select next evaluation point. O(n³) per step — only feasible for < 50–100 evaluations. Used for hyperparameter optimization, not model training.

Optimizing expensive black-box functions (hyperparameter tuning, A/B experiment design). When gradients are unavailable or unreliable.

PropertyVanilla GDSGD+MomentumRMSPropAdam
Convergence speed🐢 Slow⚡ Fast⚡ Fast⚡⚡ Fastest
Memory per stepO(d)O(2d)O(2d)O(3d)
LR sensitivityHighHighLowLow
Sparse gradientsPoorPoorGoodExcellent
Generalization (CNNs)PoorBestGoodOK
Default for NLPNoNoNoYes (AdamW)

You need to minimize a differentiable loss function over continuous parameters — which is essentially every modern ML model training scenario. The specific variant (SGD+momentum vs. Adam) depends on model architecture and task.

12

Training Loss

Loss on the training set over epochs. Should decrease monotonically (batch GD) or decrease on average (SGD). Sudden spikes indicate learning rate is too large or a bad batch.

Target: Decreasing trend throughout training; plateau at minimum indicates convergence

Validation Loss

Loss on held-out validation set. The true indicator of model quality. If validation loss rises while training loss falls: overfitting. Both high: underfitting.

Target: Should track training loss closely (small gap) and continue decreasing or plateau

Gradient Norm

Magnitude of the gradient vector. Very large (> 100): exploding gradients, apply clipping. Very small (< 1e-7): vanishing gradients or already at minimum. Healthy range: 0.1–10.

Target: Decreasing over training; should not explode or vanish

Parameter Update Ratio

Ratio of parameter update magnitude to parameter magnitude. Recommended range: 0.001 (Andrej Karpathy's heuristic). If ratio >> 0.01: LR too large. If << 0.001: LR too small.

Target: ≈ 1e-3 (0.001) — Karpathy's rule of thumb for healthy gradient updates

  1. 01.1. Plot training and validation loss per epoch on the same chart.
  2. 02.2. Monitor gradient norm per step — clip if > max_norm (1.0 for transformers).
  3. 03.3. Check parameter update ratio (update magnitude / parameter magnitude) ≈ 1e-3.
  4. 04.4. If loss oscillates: reduce LR or switch to LR schedule.
  5. 05.5. If loss plateaus early: LR too small, or model has converged to a poor basin — try larger LR + warmup.
  6. 06.6. Apply early stopping: stop when validation loss hasn't improved for patience=10 epochs.
  • Evaluating on training loss alone — validation loss is the true measure; train loss can look great while overfitting.
  • Training for a fixed number of epochs without early stopping — may overtrain or undertrain depending on dataset.
  • Not monitoring gradient norms — silent exploding gradients corrupt the model without showing in the loss immediately.
  • Interpreting noisy SGD loss as oscillation/divergence — normal SGD loss curves are noisy; use rolling average to see the trend.

Training loss starts at 2.3 (random init BCE), drops to 0.4 by epoch 5, plateaus at 0.35 by epoch 20. Validation loss tracks closely (0.38) until epoch 15, then slowly rises to 0.45 by epoch 30. Early stopping at epoch 20 (patience=5). Gradient norm starts at 8.2, drops to 0.3 by epoch 10. Interpretation: model trained well, slight overfitting began at epoch 15. The checkpoint at epoch 20 is the best model.

13
  • ×Confusing 'epoch' and 'step' (gradient update) — one epoch is n/batch_size steps, not one step.
  • ×Thinking a higher learning rate always trains faster — above the stability threshold, it diverges rather than converges.
  • ×Not shuffling data before each epoch — model learns dataset ordering artifacts instead of actual patterns.
  • ×Using the same learning rate for all optimizers — Adam needs ~10× smaller LR than SGD+momentum.
  • ×Forgetting optimizer.zero_grad() in PyTorch — gradients accumulate across batches, causing incorrect updates.
  • ×Using Adam without weight decay (should use AdamW for regularization) — L2 regularization interacts incorrectly with Adam's adaptive scaling.
  • ×Not applying gradient clipping for RNNs/transformers — one bad batch can corrupt 10K previous gradient steps.
  • ×Using learning rate scheduling at the wrong granularity — OneCycleLR schedules per-step, CosineAnnealing per-epoch.
  • ×Saying SGD and gradient descent are the same — SGD uses one or a mini-batch of samples, not the full dataset.
  • ×Not knowing why Adam's bias correction matters — without it, the first few steps have effectively zero learning rate.
  • ×Stating gradient descent always finds the global minimum — only guaranteed for convex functions.
  • ×Confusing learning rate and batch size effects — both affect convergence speed but via different mechanisms.
  • ×Not monitoring gradient norms — vanishing/exploding gradients cause silent training failure that looks like a plateau.
  • ×Switching optimizer mid-training without resetting momentum states — accumulated momentum from the old optimizer poisons the new optimizer.
  • ×Training without a validation set — no way to know if the model is overfitting or if early stopping is needed.
  • ×Using batch size 1 (pure SGD) for a GPU — massively underutilizes GPU parallelism; use the largest batch that fits in memory.
14

What kind of bias does this model have?

Bias depends on model assumptions and feature expressiveness.

What kind of variance does it have?

Variance grows with model flexibility and weak regularization.

How does it overfit?

Overfitting usually appears as strong train performance but weaker validation/test behavior.

How do we regularize it?

Use complexity constraints, robust validation, and data-centric cleanup.

What kind of data does it like?

Prefers representative, low-leakage data with stable feature definitions.

What kind of data breaks it?

Breaks under leakage, severe distribution drift, noisy labels, and poorly engineered features.

14

Quick Revision Reference

  • Core update: θ ← θ − α·∇L(θ) — always move opposite to the gradient
  • Batch GD: full dataset per step (exact gradient, slow). SGD: one sample (noisy, fast). Mini-batch: best of both
  • Momentum: adds velocity to dampen oscillations and accelerate in consistent directions
  • RMSProp: per-parameter adaptive LR based on running average of squared gradients
  • Adam = momentum + RMSProp + bias correction; most widely used optimizer
  • AdamW = Adam + decoupled weight decay (correct L2 regularization for Adam)
  • Learning rate is the most important hyperparameter — use LR finder or literature values
  • Always monitor training vs. validation loss; apply early stopping when validation rises
GD Update
Momentum
RMSProp
Adam
Bias Correction
  • Training any neural network
  • Optimization when closed-form solution is unavailable
  • Large-scale datasets where full-batch computation is infeasible
  • Online learning with streaming data
  • Closed-form solution exists (use OLS, or L-BFGS for moderate-scale problems)
  • Non-differentiable loss (use evolutionary algorithms)
  • Hyperparameter optimization (use Bayesian optimization)
  • Discrete parameter spaces
Write the Adam update equations from memory including bias correction
Explain why SGD noise can improve generalization (flatter minima)
Describe the convergence rate difference: convex O(1/T) vs. strongly convex O(ρᵀ)
Explain why AdamW is preferable to Adam + L2 for regularization
Know when to prefer SGD+momentum over Adam (image classification CNNs)
15
16

These questions are designed to break assumptions and expose weak understanding. Most people will answer them wrong on their first attempt. Work through each one carefully.