Notes on Gradient Descent
I spent most of this week working through gradient descent from first principles. I wanted to understand not just what it does — move parameters in the direction that reduces loss — but why certain variants work better in practice, and what the geometric intuition actually is.
The core idea is clean: given a loss function $L(\theta)$, we compute the gradient $\nabla_\theta L$ and update parameters in the opposite direction:
$$\theta \leftarrow \theta - \eta \cdot \nabla_\theta L(\theta)$$
The gradient points in the direction of steepest ascent in the loss surface, so subtracting it moves us toward a local minimum. The parameter $\eta$ (the learning rate) controls the step size.
Why learning rate matters so much
The learning rate is arguably the most important hyperparameter in training. Too large and the updates overshoot — you bounce around the loss surface and may diverge entirely. Too small and training is painfully slow, sometimes getting trapped in regions with flat gradients.
There’s a useful analogy: imagine descending a foggy hillside where you can only feel the slope immediately under your feet. A large step means you might walk off a cliff or overshoot a valley. A tiny step means you’ll eventually get somewhere, but it’ll take forever.
In practice this manifests in training curves. With $\eta$ too high you see the loss oscillate or spike. With $\eta$ too low the curve decreases but asymptotes well above the theoretical minimum. Learning rate schedules — cosine annealing, linear warmup — exist because neither extreme is optimal throughout training.
One useful diagnostic: if loss decreases smoothly in early epochs then plateaus, you probably need to reduce the learning rate. If it’s already unstable in the first few steps, reduce it immediately.
In practice, vanilla gradient descent (computing gradients over the entire dataset) is almost never used. Stochastic Gradient Descent (SGD) computes gradients on small batches, which introduces noise but allows far more frequent updates. Surprisingly, this noise is often beneficial — it acts as a form of regularization, helping models escape sharp minima that generalize poorly.
The bias-variance tradeoff appears here in a different form: small batches give noisier but cheaper gradient estimates; large batches give more accurate estimates but reduce the number of update steps per epoch.
Momentum and its intuition
Standard SGD treats each update as independent — compute gradient, step, forget. Momentum carries forward a weighted average of past gradients, giving the optimizer a sense of “velocity.”
The update rule becomes:
$$v_t = \beta v_{t-1} + (1 - \beta) \nabla_\theta L(\theta)$$ $$\theta \leftarrow \theta - \eta \cdot v_t$$
With $\beta = 0.9$, roughly the last 10 gradient steps contribute to the current direction. This has a few useful effects:
- Dampens oscillation in directions where the gradient sign keeps flipping (common in narrow valleys). The conflicting gradients cancel while the consistent component accumulates.
- Accelerates in consistent directions — if the gradient keeps pointing the same way, velocity builds up, effectively giving a larger step size without changing $\eta$.
- Escapes shallow local minima — enough accumulated velocity can carry the optimizer past a small bump.
The geometric intuition: instead of reacting to every local slope, the optimizer builds up inertia. It’s the difference between a ball rolling down a hill (momentum) vs. a person carefully placing each foot (vanilla SGD).
Nesterov momentum is a refinement where you compute the gradient at the “lookahead” position $\theta - \eta \beta v_{t-1}$ rather than the current position. This gives slightly better convergence properties in theory, and works noticeably better in practice on some problems.
What surprised me most: a lot of the practical wisdom around gradient descent — warmup schedules, gradient clipping, batch size scaling — doesn’t follow cleanly from the theory. It comes from empirical observations about what works. The theory gives intuitions and bounds, but the implementation details come from practitioners learning what breaks.
Next I want to look at Adam and AdamW — adaptive learning rate methods that maintain per-parameter estimates of gradient variance. They’re the default in most modern training setups and there’s interesting work on why they interact differently with weight decay than SGD does.