Deep Learning Visualized
Topic design preview Back to module

Module 03 / Optimization

Adam vs SGD
Optimizer Comparison

Compare how SGD, Momentum, RMSProp, and Adam transform the same gradient into different parameter updates.

Visual design preview Original interaction preserved

Interactive lesson

Complete current prototype loaded without changing its teaching content.

Adam vs SGD - Optimizer Comparison

3D loss surface

Drag to rotate freely. Each optimizer traces its own path on the same surface. The floor projection shows the (x, y) trajectory from above.

Show:
Step
0 / 120
Learning rate η
0.010
Momentum β1
0.90
Steps
120
Loss surface
Rosen
Update rules
all three optimizers
Adam:  m = β1m + (1-β1)g  |  v = β2v + (1-β2)g2  |  θ ← θ - η·m̂/(√v̂+ε)
SGD:   θ ← θ - η·g
SGD+M: v ← β·v + η·g  |  θ ← θ - v
Step metrics
step 0
Loss · Adam
-
Loss · SGD
-
Loss · SGD+M
-
Dist Adam→min
-
Dist SGD→min
-
Dist SGD+M→min
-
Takeaway - Rosenbrock surface
On the Rosenbrock surface, the narrow curved banana valley is notoriously hard. Gradients across the valley are much larger than along it - SGD oscillates wildly while Adam's per-parameter scaling lets it glide smoothly toward (1, 1). SGD+Momentum accumulates velocity along the valley floor but may overshoot on the curve.
Choose a surface (Rosenbrock / Ravine / Saddle), select which optimizers to show, then press Play to watch the trajectories unfold. Drag the 3D plot to rotate 360°.

Background

An optimizer decides how a neural network’s parameters move after each gradient calculation.

The simplest optimizer is Stochastic Gradient Descent, or SGD. It takes the current gradient and moves the parameters in the opposite direction. This is simple and memory-efficient, but the path can zig-zag badly on narrow valleys or steep loss surfaces.

Momentum improves SGD by remembering past gradient directions. Instead of reacting only to the current gradient, it builds a smoother direction over time.

RMSProp uses a different idea. It tracks recent squared gradients and rescales the update for each parameter. Directions with consistently large gradients receive smaller effective steps.

Adam combines both ideas. It uses a Momentum-like term to smooth the direction and an RMSProp-like term to adapt the step size. This often makes Adam faster and more stable early in training, but it does not mean Adam is always better than SGD.

The core idea is:

same gradient → different optimizer rule → different update path

Important formulas

\[ g_t = \frac{1}{B} \sum_{i\in \mathcal{B}_t} \nabla_{\theta} \ell_i(\theta_t) \]

Mini-Batch Gradient Input.

  • \(g_t\): The gradient used at training step \(t\).
  • \(B\): The mini-batch size.
  • \(\mathcal{B}_t\): The mini-batch used at step \(t\).
  • \(\ell_i(\theta_t)\): The loss from training example \(i\).
  • \(\theta_t\): The model parameters before the update.

All optimizers start from a gradient. The difference is how they transform this gradient into an update.

\[ \theta_{t+1} = \theta_t - \eta g_t \]

Plain SGD.

  • \(\theta_t\): Parameters before the update.
  • \(\theta_{t+1}\): Parameters after the update.
  • \(\eta\): The learning rate.
  • \(g_t\): The current mini-batch gradient.

SGD uses the current gradient directly. Every parameter shares the same global learning rate \(\eta\).

\[ u_t = \beta u_{t-1} + g_t \]
\[ \theta_{t+1} = \theta_t - \eta u_t \]

Momentum Idea.

  • \(u_t\): The accumulated update direction.
  • \(u_{t-1}\): The previous accumulated direction.
  • \(\beta\): The momentum coefficient.
  • \(g_t\): The current gradient.
  • \(\eta\): The learning rate.

Momentum smooths the update direction. It helps reduce zig-zag movement when gradients change direction quickly.

\[ s_t = \rho s_{t-1} + (1-\rho)(g_t \odot g_t) \]
\[ \theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{s_t}+\epsilon} \]

RMSProp Idea.

  • \(s_t\): The running average of squared gradients.
  • \(\rho\): The decay rate for the squared-gradient average.
  • \(g_t \odot g_t\): Element-wise square of the gradient.
  • \(\epsilon\): A small number for numerical stability.
  • \(\eta\): The global learning rate.

RMSProp rescales each parameter’s update. Parameters with consistently large gradients receive smaller effective steps.

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \]

Adam First Moment.

  • \(m_t\): The first moment estimate.
  • \(m_{t-1}\): The previous first moment estimate.
  • \(\beta_1\): The decay rate for the first moment.
  • \(g_t\): The current gradient.

This is Adam’s Momentum-like part. It tracks a smoothed gradient direction.

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t \odot g_t) \]

Adam Second Moment.

  • \(v_t\): The second moment estimate.
  • \(v_{t-1}\): The previous second moment estimate.
  • \(\beta_2\): The decay rate for the second moment.
  • \(g_t \odot g_t\): Element-wise squared gradient.

This is Adam’s RMSProp-like part. It tracks recent squared gradient magnitudes.

\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t} \]
\[ \hat{v}_t = \frac{v_t}{1-\beta_2^t} \]

Adam Bias Correction.

  • \(\hat{m}_t\): The bias-corrected first moment.
  • \(\hat{v}_t\): The bias-corrected second moment.
  • \(t\): The current training step.
  • \(\beta_1^t\): The first-moment decay factor after \(t\) steps.
  • \(\beta_2^t\): The second-moment decay factor after \(t\) steps.

Adam starts \(m_0\) and \(v_0\) at zero. Bias correction makes early estimates more reliable.

\[ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

Adam Update.

  • \(\theta_t\): Parameters before the update.
  • \(\theta_{t+1}\): Parameters after the update.
  • \(\hat{m}_t\): The corrected smoothed direction.
  • \(\hat{v}_t\): The corrected squared-gradient scale.
  • \(\eta\): The global learning rate.
  • \(\epsilon\): A small value that prevents division by zero.

Adam uses \(\hat{m}_t\) for direction and \(\hat{v}_t\) for adaptive scaling.

\[ \eta_{\text{eff},j} = \frac{\eta}{\sqrt{\hat{v}_{t,j}}+\epsilon} \]

Adam Effective Step Size.

  • \(\eta_{\text{eff},j}\): The effective learning rate for parameter \(j\).
  • \(j\): The index of one parameter.
  • \(\hat{v}_{t,j}\): The corrected second moment for parameter \(j\).
  • \(\eta\): The global learning rate.
  • \(\epsilon\): A small value for numerical stability.

SGD uses one learning rate for all parameters. Adam gives each parameter its own effective step size.

Pros and cons

SGD

Pros

  • Simple and lightweight: SGD is easy to understand and uses very little extra memory.
  • Strong baseline: With good tuning, SGD can still perform very well in many deep learning tasks.
  • Good final generalization: In some settings, SGD or SGD with Momentum can generalize very strongly.

Cons

  • Sensitive to learning rate: If the learning rate is too large, training can diverge. If it is too small, training becomes slow.
  • Can zig-zag: On narrow valleys, SGD may bounce from side to side instead of moving smoothly forward.
  • Same step scale for every parameter: SGD does not automatically adjust different parameters based on gradient scale.

Momentum

Pros

  • Smoother path: Momentum reduces noisy direction changes by carrying past movement forward.
  • Faster in consistent directions: If gradients point in a similar direction for several steps, Momentum can accelerate progress.
  • Useful bridge between SGD and Adam: It introduces the idea of remembering previous gradients.

Cons

  • Can overshoot: Too much momentum may push the parameters past a good region.
  • Adds another hyperparameter: The momentum coefficient \(\beta\) needs to be chosen carefully.
  • Still uses one global learning rate: Momentum smooths direction, but it does not adapt step size per parameter.

Adam

Pros

  • Fast early training: Adam often makes quick progress at the beginning of training.
  • Adaptive per-parameter scaling: Each parameter receives its own effective step size based on recent squared gradients.
  • More stable on difficult surfaces: Adam often handles ravines, curved valleys, and uneven gradient scales better than plain SGD.
  • Combines two useful ideas: It uses Momentum-like direction smoothing and RMSProp-like adaptive scaling.

Cons

  • Uses more memory: Adam stores both \(m_t\) and \(v_t\) for every parameter.
  • More moving parts: Adam has \(\eta\), \(\beta_1\), \(\beta_2\), and \(\epsilon\), so it is less simple than SGD.
  • Not always the best final choice: Adam is often strong early, but SGD or Momentum can still match or outperform it when tuned well.
  • Can be overused as a default: Adam is convenient, but it should still be understood and tuned.

Example and mistake

Quick Example

Imagine a narrow curved valley on a loss surface.

SGD follows the current gradient directly. If the gradient is steep across the valley but shallow along the valley, SGD may bounce from one side to the other.

Momentum remembers previous directions. It can reduce the side-to-side bouncing and build speed along the valley, but it may overshoot when the valley curves.

Adam smooths the direction like Momentum and rescales the step size like RMSProp. If one direction has consistently large gradients, Adam shrinks the effective step in that direction. This often makes the trajectory more controlled.

The important difference is not that one optimizer receives a different gradient. They receive the same kind of gradient input. The difference is how each optimizer transforms that gradient into the next parameter update.

Common mistakes

Mistake 1: Thinking Adam is always better than SGD

Adam often trains faster early, but SGD or Momentum can still be very competitive when the learning rate is tuned well.

Mistake 2: Forgetting Adam has two memories

Adam tracks both \(m_t\) for smoothed gradient direction, and \(v_t\) for squared-gradient magnitude. These two terms do different jobs.

Mistake 3: Confusing Momentum’s \(u_t\) with Adam’s \(v_t\)

Momentum’s \(u_t\) stores an accumulated direction. Adam’s \(v_t\) stores squared-gradient magnitude. They are not the same concept.

Mistake 4: Ignoring bias correction

Adam starts \(m_0\) and \(v_0\) at zero, so early estimates are biased toward zero. Bias correction helps fix this in the first training steps.

Mistake 5: Thinking adaptive scaling means no tuning

Adam adjusts step sizes automatically, but the learning rate still matters. A bad learning rate can still make Adam unstable or slow.

Takeaway

SGD follows the current gradient directly using one global learning rate.

Momentum smooths the direction by remembering past gradients. RMSProp rescales steps using recent squared gradients.

Adam combines both ideas: it smooths the update direction with \(m_t\), rescales each parameter with \(v_t\), and uses bias correction to make early updates more reliable. This often makes Adam fast and stable early in training, but it is not automatically better than SGD in every situation.