All blog topics
Optimization

Momentum

Learn how Momentum gives gradient descent memory, reducing zig-zag movement and building speed in consistent directions.

SGD zig-zag path compared with smoother Momentum path
Image 1: Momentum reduces oscillation and preserves forward progress.

Background

Plain SGD updates parameters using only the current gradient. This is simple, but it can be unstable when the loss surface has a narrow valley.

In a narrow valley, the gradient may be steep across the valley but weak along the valley. As a result, SGD can bounce from side to side instead of moving smoothly forward.

Momentum adds memory to the optimizer. Instead of trusting only the current gradient, it combines the current gradient with a running average of past gradients.

Momentum strengthens directions that stay consistent and weakens directions that keep flipping. This helps the optimizer move more smoothly through ravines and curved valleys.

Idea

The core idea is: current gradient plus past direction memory gives a smoother update direction.

If gradients keep pointing in a similar direction, the momentum buffer builds in that direction. If gradients keep flipping direction, the buffer weakens that oscillating direction.

Momentum is therefore not only about speed. It changes the optimization path by making consistent movement stronger and unstable movement weaker.

Momentum buffer combines previous momentum and current gradient
Image 2: The momentum buffer stores direction memory before the parameter update.

Important Formulas

\[g_t=\frac{1}{B}\sum_{i\in\mathcal{B}_t}\nabla_\theta \ell_i(\theta_t)\]

Mini-batch gradient input. Momentum changes how gradients are remembered, not how this gradient is computed.

\[\theta_{t+1}=\theta_t-\eta g_t\]

Plain SGD reacts directly to the current mini-batch gradient.

\[u_t=\beta u_{t-1}+(1-\beta)g_t\]

Momentum buffer: a smoothed gradient direction with memory.

\[\theta_{t+1}=\theta_t-\eta u_t\]

The parameter update uses the momentum buffer instead of the raw gradient.

\[u_t=(1-\beta)g_t+(1-\beta)\beta g_{t-1}+(1-\beta)\beta^2 g_{t-2}+(1-\beta)\beta^3 g_{t-3}+\cdots\]

Expanded memory view: recent gradients matter more, while older gradients fade gradually.

\[u_{dW,t}^{[l]}=\beta u_{dW,t-1}^{[l]}+(1-\beta)dW_t^{[l]}\]

Layer-wise momentum buffer for weight gradients.

\[u_{db,t}^{[l]}=\beta u_{db,t-1}^{[l]}+(1-\beta)db_t^{[l]}\]

Layer-wise momentum buffer for bias gradients.

\[W_{t+1}^{[l]}=W_t^{[l]}-\eta u_{dW,t}^{[l]}\]

Layer-wise weight update.

\[b_{t+1}^{[l]}=b_t^{[l]}-\eta u_{db,t}^{[l]}\]

Layer-wise bias update.

Symbols

  1. g_t: mini-batch gradient at step t.
  2. B: mini-batch size.
  3. theta_t: model parameters before the update.
  4. eta: learning rate.
  5. u_t: momentum buffer, or smoothed gradient direction.
  6. beta: momentum coefficient, often around 0.9.
  7. dW_t and db_t: weight and bias gradients for one layer.

Pros

ProsWhy it helps
Reduces zig-zag movementMomentum smooths directions that change too quickly across steps.
Speeds up consistent progressWhen gradients keep pointing in a useful direction, the buffer builds movement in that direction.
Smooths mini-batch noiseOne noisy mini-batch has less control over the update direction.
Simple and memory-efficientIt only adds one extra buffer for each parameter.
Strong practical baselineSGD with Momentum is still powerful when the learning rate is tuned well.

Cons

ConsWhy it matters
Adds one more hyperparameterThe momentum coefficient beta must be chosen carefully.
Can overshootIf the learning rate or momentum is too large, the optimizer may move past a good region.
No per-parameter scalingMomentum smooths direction, but does not automatically resize each parameter step like RMSProp or Adam.
Still needs learning-rate tuningMomentum helps the path, but a bad learning rate can still make training unstable.
Notation variesDifferent books may write Momentum with different signs or scaling conventions.

Quick Example

Suppose the optimizer is moving through a narrow valley. Let beta = 0.9, and assume the first two gradients are g_1 = (10, 1) and g_2 = (-9, 1).

The first coordinate changes direction sharply, but the second coordinate stays positive. Start with u_0 = (0, 0).

Example Calculation

\[u_1=0.9(0,0)+0.1(10,1)=(1,0.1)\]

The first update stores a small version of the first gradient.

\[u_2=0.9(1,0.1)+0.1(-9,1)=(0,0.19)\]

The oscillating first coordinate cancels, while the consistent second coordinate accumulates.

Common Mistakes

  1. Thinking Momentum changes how g_t is computed. It does not; it changes how current and past gradients are combined before the update.
  2. Confusing beta with the learning rate. Eta controls step size; beta controls how much past direction is remembered.
  3. Using too much Momentum with too large a learning rate, which can cause overshooting or unstable movement.
  4. Confusing Momentum's u_t with Adam's v_t. Momentum stores direction memory; Adam's v_t usually stores squared-gradient magnitude.
  5. Thinking Momentum only makes training faster. It also changes the path by reducing oscillation.

Takeaway

SGD follows the current gradient directly. Momentum gives SGD memory by keeping a moving average of past gradients.

Directions that stay consistent become stronger, while directions that keep flipping become weaker. This makes the optimization path smoother, especially in narrow valleys where plain SGD tends to zig-zag.