All blog topics
Optimization

RMSProp

Learn how RMSProp gives each parameter an adaptive step size by tracking recent squared gradients.

RMSProp balances parameter updates compared with plain SGD
Image 1: RMSProp rescales each parameter using recent squared gradients.

Background

Plain SGD uses one global learning rate for every parameter. This is simple, but it can be hard to tune when different parameters have very different gradient scales.

If one direction has very large gradients, SGD may take unstable steps in that direction. If another direction has small gradients, SGD may move too slowly there.

RMSProp solves this by tracking recent squared gradients. Instead of using the raw gradient directly, it divides each gradient component by a running root-mean-square scale.

RMSProp is different from Momentum. Momentum remembers direction. RMSProp remembers scale.

Idea

The core idea is: large recent gradients create a larger denominator, which creates a smaller effective step.

SGD uses one learning rate. RMSProp creates a different effective learning rate for each parameter by looking at the recent squared-gradient magnitude.

This makes RMSProp helpful on badly scaled surfaces, where one parameter direction can otherwise dominate the update.

Important Formulas

\[g_t=\frac{1}{B}\sum_{i\in\mathcal{B}_t}\nabla_\theta \ell_i(\theta_t)\]

Mini-batch gradient input. RMSProp starts with the same gradient as SGD.

\[\theta_{t+1}=\theta_t-\eta g_t\]

Plain SGD applies the same learning rate to every parameter.

\[s_t=\rho s_{t-1}+(1-\rho)(g_t\odot g_t)\]

Squared-gradient accumulator. It stores recent gradient magnitude, not direction.

\[\theta_{t+1}=\theta_t-\eta\frac{g_t}{\sqrt{s_t}+\epsilon}\]

RMSProp scales the gradient element-wise before updating the parameters.

\[\eta_{\text{eff},j}=\frac{\eta}{\sqrt{s_{t,j}}+\epsilon}\]

Effective learning rate for one parameter j.

\[s_t=(1-\rho)(g_t\odot g_t)+(1-\rho)\rho(g_{t-1}\odot g_{t-1})+(1-\rho)\rho^2(g_{t-2}\odot g_{t-2})+\cdots\]

Expanded memory view: recent squared gradients matter more, while older ones fade.

\[s_{dW,t}^{[l]}=\rho s_{dW,t-1}^{[l]}+(1-\rho)(dW_t^{[l]}\odot dW_t^{[l]})\]

Layer-wise RMSProp accumulator for weight gradients.

\[s_{db,t}^{[l]}=\rho s_{db,t-1}^{[l]}+(1-\rho)(db_t^{[l]}\odot db_t^{[l]})\]

Layer-wise RMSProp accumulator for bias gradients.

\[W_{t+1}^{[l]}=W_t^{[l]}-\eta\frac{dW_t^{[l]}}{\sqrt{s_{dW,t}^{[l]}}+\epsilon}\]

Layer-wise weight update.

\[b_{t+1}^{[l]}=b_t^{[l]}-\eta\frac{db_t^{[l]}}{\sqrt{s_{db,t}^{[l]}}+\epsilon}\]

Layer-wise bias update.

Symbols

  1. g_t: mini-batch gradient at step t.
  2. eta: global learning rate.
  3. s_t: running average of recent squared gradients.
  4. rho: decay rate for the accumulator.
  5. epsilon: small value for numerical stability.
  6. odot: element-wise multiplication.
  7. eta_eff,j: effective learning rate for parameter j.

Pros

ProsWhy it helps
Adaptive per-parameter scalingRMSProp adjusts each update based on recent gradient magnitude.
Helpful on badly scaled surfacesIt reduces unstable movement in directions with consistently large gradients.
Less dependent on one perfect learning rateThe global learning rate still matters, but each parameter gets its own effective step size.
Important bridge to AdamAdam uses a similar squared-gradient accumulator as part of its adaptive scaling.
Works with noisy mini-batchesRecent squared-gradient memory smooths sudden gradient-scale changes.

Cons

ConsWhy it matters
Adds hyperparametersRho, eta, and epsilon must be chosen sensibly.
Can shrink useful directions too muchLarge accumulated squared gradients can make a parameter's effective step very small.
No direction memoryRMSProp rescales gradients, but Momentum is still the idea that smooths direction.
Less transparent than SGDThe effective learning rate changes per parameter over time.
Often replaced by AdamAdam combines Momentum-like direction memory with RMSProp-like scaling.

Quick Example

Suppose two parameters have the same global learning rate but very different gradient magnitudes. Let theta_t = (0, 0), g_t = (100, 1), and eta = 0.01.

Plain SGD would update by eta g_t = (1, 0.01), so the first parameter dominates the movement.

With RMSProp, use rho = 0.9, s_{t-1} = (0, 0), and ignore epsilon only for this simplified calculation. The squared-gradient accumulator becomes s_t = (1000, 0.1), so the root scale is approximately (31.62, 0.316).

RMSProp scales raw gradients into balanced parameter updates
Image 2: RMSProp turns uneven raw gradients into more balanced updates.

Example Calculation

\[\eta g_t=0.01(100,1)=(1,0.01)\]

Plain SGD produces a highly imbalanced update.

\[s_t=0.9(0,0)+0.1(10000,1)=(1000,0.1)\]

RMSProp stores recent squared-gradient magnitude.

\[\sqrt{s_t}\approx(31.62,0.316)\]

The larger-gradient direction receives a larger denominator.

\[\frac{g_t}{\sqrt{s_t}+\epsilon}\approx\frac{(100,1)}{(31.62,0.316)}\approx(3.16,3.16)\]

The scaled gradient becomes balanced.

\[\theta_{t+1}\approx(0,0)-0.01(3.16,3.16)=(-0.0316,-0.0316)\]

The large-gradient direction no longer dominates the step.

Common Mistakes

  1. Thinking RMSProp changes how g_t is computed. It does not; it changes how the gradient is scaled before the update.
  2. Confusing RMSProp with Momentum. Momentum remembers past gradient directions; RMSProp remembers past squared-gradient magnitudes.
  3. Forgetting that the square, square root, and division are element-wise operations.
  4. Thinking RMSProp removes the need for learning-rate tuning. The global learning rate eta still matters.
  5. Confusing RMSProp with Adam. Adam tracks both a first moment and a second moment, then applies bias correction.

Takeaway

RMSProp gives each parameter an adaptive step size by tracking recent squared gradients.

Directions with consistently large gradients receive smaller effective steps, while directions with smaller gradients are not suppressed as much.

Momentum remembers direction. RMSProp remembers scale. Adam later combines both ideas.