Module 03 / Optimization

Adam vs SGD
Optimizer Comparison

Compare how SGD, Momentum, RMSProp, and Adam transform the same gradient into different parameter updates.

Visual design preview Original interaction preserved

Explore

Interactive lesson

Complete current prototype loaded without changing its teaching content.

Adam vs SGD - Optimizer Comparison

Interactive Optimizer Lab

Adam vs SGD
Optimizer Comparison

3D loss surface · watch Adam, SGD, and SGD+Momentum converge from the same start · drag to rotate 360°

Key concepts

What to watch for

Adam

Adaptive step scaling helps it stay stable when curvature changes across directions.

SGD

Simple and direct, but it tends to oscillate on steep walls before settling into the valley.

SGD+M

Smoother than plain SGD because momentum carries useful direction across multiple updates.

3D loss surface

Drag to rotate freely. Each optimizer traces its own path on the same surface. The floor projection shows the (x, y) trajectory from above.

Show:

Step

0 / 120

Learning rate η

0.010

Momentum β₁

0.90

Steps

120

Loss surface

Rosen

Show paths Floor projection Highlight step Speed: 300 ms

Update rules

all three optimizers

Adam: m = β₁m + (1-β₁)g | v = β₂v + (1-β₂)g² | θ ← θ - η·m̂/(√v̂+ε)
SGD: θ ← θ - η·g
SGD+M: v ← β·v + η·g | θ ← θ - v

Step metrics

step 0

Loss · Adam

Loss · SGD

Loss · SGD+M

Dist Adam→min

Dist SGD→min

Dist SGD+M→min

Takeaway - Rosenbrock surface

On the Rosenbrock surface, the narrow curved banana valley is notoriously hard. Gradients across the valley are much larger than along it - SGD oscillates wildly while Adam's per-parameter scaling lets it glide smoothly toward (1, 1). SGD+Momentum accumulates velocity along the valley floor but may overshoot on the curve.

Choose a surface (Rosenbrock / Ravine / Saddle), select which optimizers to show, then press Play to watch the trajectories unfold. Drag the 3D plot to rotate 360°.

Context

Background

An optimizer decides how a neural network’s parameters move after each gradient calculation.

The simplest optimizer is Stochastic Gradient Descent, or SGD. It takes the current gradient and moves the parameters in the opposite direction. This is simple and memory-efficient, but the path can zig-zag badly on narrow valleys or steep loss surfaces.

Momentum improves SGD by remembering past gradient directions. Instead of reacting only to the current gradient, it builds a smoother direction over time.

RMSProp uses a different idea. It tracks recent squared gradients and rescales the update for each parameter. Directions with consistently large gradients receive smaller effective steps.

Adam combines both ideas. It uses a Momentum-like term to smooth the direction and an RMSProp-like term to adapt the step size. This often makes Adam faster and more stable early in training, but it does not mean Adam is always better than SGD.

The core idea is:

same gradient → different optimizer rule → different update path

Notation

Important formulas

\[ g_t = \frac{1}{B} \sum_{i\in \mathcal{B}_t} \nabla_{\theta} \ell_i(\theta_t) \]

Mini-Batch Gradient Input.

\(g_t\): The gradient used at training step \(t\).
\(B\): The mini-batch size.
\(\mathcal{B}_t\): The mini-batch used at step \(t\).
\(\ell_i(\theta_t)\): The loss from training example \(i\).
\(\theta_t\): The model parameters before the update.

All optimizers start from a gradient. The difference is how they transform this gradient into an update.

\[ \theta_{t+1} = \theta_t - \eta g_t \]

Plain SGD.

\(\theta_t\): Parameters before the update.
\(\theta_{t+1}\): Parameters after the update.
\(\eta\): The learning rate.
\(g_t\): The current mini-batch gradient.

SGD uses the current gradient directly. Every parameter shares the same global learning rate \(\eta\).

\[ u_t = \beta u_{t-1} + g_t \]

\[ \theta_{t+1} = \theta_t - \eta u_t \]

Momentum Idea.

\(u_t\): The accumulated update direction.
\(u_{t-1}\): The previous accumulated direction.
\(\beta\): The momentum coefficient.
\(g_t\): The current gradient.
\(\eta\): The learning rate.

Momentum smooths the update direction. It helps reduce zig-zag movement when gradients change direction quickly.

\[ s_t = \rho s_{t-1} + (1-\rho)(g_t \odot g_t) \]

\[ \theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{s_t}+\epsilon} \]

RMSProp Idea.

\(s_t\): The running average of squared gradients.
\(\rho\): The decay rate for the squared-gradient average.
\(g_t \odot g_t\): Element-wise square of the gradient.
\(\epsilon\): A small number for numerical stability.
\(\eta\): The global learning rate.

RMSProp rescales each parameter’s update. Parameters with consistently large gradients receive smaller effective steps.

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \]

Adam First Moment.

\(m_t\): The first moment estimate.
\(m_{t-1}\): The previous first moment estimate.
\(\beta_1\): The decay rate for the first moment.
\(g_t\): The current gradient.

This is Adam’s Momentum-like part. It tracks a smoothed gradient direction.

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t \odot g_t) \]

Adam Second Moment.

\(v_t\): The second moment estimate.
\(v_{t-1}\): The previous second moment estimate.
\(\beta_2\): The decay rate for the second moment.
\(g_t \odot g_t\): Element-wise squared gradient.

This is Adam’s RMSProp-like part. It tracks recent squared gradient magnitudes.

\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t} \]

\[ \hat{v}_t = \frac{v_t}{1-\beta_2^t} \]

Adam Bias Correction.

\(\hat{m}_t\): The bias-corrected first moment.
\(\hat{v}_t\): The bias-corrected second moment.
\(t\): The current training step.
\(\beta_1^t\): The first-moment decay factor after \(t\) steps.
\(\beta_2^t\): The second-moment decay factor after \(t\) steps.

Adam starts \(m_0\) and \(v_0\) at zero. Bias correction makes early estimates more reliable.

\[ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

Adam Update.

\(\theta_t\): Parameters before the update.
\(\theta_{t+1}\): Parameters after the update.
\(\hat{m}_t\): The corrected smoothed direction.
\(\hat{v}_t\): The corrected squared-gradient scale.
\(\eta\): The global learning rate.
\(\epsilon\): A small value that prevents division by zero.

Adam uses \(\hat{m}_t\) for direction and \(\hat{v}_t\) for adaptive scaling.

\[ \eta_{\text{eff},j} = \frac{\eta}{\sqrt{\hat{v}_{t,j}}+\epsilon} \]

Adam Effective Step Size.

\(\eta_{\text{eff},j}\): The effective learning rate for parameter \(j\).
\(j\): The index of one parameter.
\(\hat{v}_{t,j}\): The corrected second moment for parameter \(j\).
\(\eta\): The global learning rate.
\(\epsilon\): A small value for numerical stability.

SGD uses one learning rate for all parameters. Adam gives each parameter its own effective step size.

Tradeoffs

Pros and cons

SGD

Pros

Simple and lightweight: SGD is easy to understand and uses very little extra memory.
Strong baseline: With good tuning, SGD can still perform very well in many deep learning tasks.
Good final generalization: In some settings, SGD or SGD with Momentum can generalize very strongly.

Cons

Sensitive to learning rate: If the learning rate is too large, training can diverge. If it is too small, training becomes slow.
Can zig-zag: On narrow valleys, SGD may bounce from side to side instead of moving smoothly forward.
Same step scale for every parameter: SGD does not automatically adjust different parameters based on gradient scale.

Momentum

Pros

Smoother path: Momentum reduces noisy direction changes by carrying past movement forward.
Faster in consistent directions: If gradients point in a similar direction for several steps, Momentum can accelerate progress.
Useful bridge between SGD and Adam: It introduces the idea of remembering previous gradients.

Cons

Can overshoot: Too much momentum may push the parameters past a good region.
Adds another hyperparameter: The momentum coefficient \(\beta\) needs to be chosen carefully.
Still uses one global learning rate: Momentum smooths direction, but it does not adapt step size per parameter.

Adam

Pros

Fast early training: Adam often makes quick progress at the beginning of training.
Adaptive per-parameter scaling: Each parameter receives its own effective step size based on recent squared gradients.
More stable on difficult surfaces: Adam often handles ravines, curved valleys, and uneven gradient scales better than plain SGD.
Combines two useful ideas: It uses Momentum-like direction smoothing and RMSProp-like adaptive scaling.

Cons

Uses more memory: Adam stores both \(m_t\) and \(v_t\) for every parameter.
More moving parts: Adam has \(\eta\), \(\beta_1\), \(\beta_2\), and \(\epsilon\), so it is less simple than SGD.
Not always the best final choice: Adam is often strong early, but SGD or Momentum can still match or outperform it when tuned well.
Can be overused as a default: Adam is convenient, but it should still be understood and tuned.

Practice

Example and mistake

Quick Example

Imagine a narrow curved valley on a loss surface.

SGD follows the current gradient directly. If the gradient is steep across the valley but shallow along the valley, SGD may bounce from one side to the other.

Momentum remembers previous directions. It can reduce the side-to-side bouncing and build speed along the valley, but it may overshoot when the valley curves.

Adam smooths the direction like Momentum and rescales the step size like RMSProp. If one direction has consistently large gradients, Adam shrinks the effective step in that direction. This often makes the trajectory more controlled.

The important difference is not that one optimizer receives a different gradient. They receive the same kind of gradient input. The difference is how each optimizer transforms that gradient into the next parameter update.

Review

Common mistakes

Mistake 1: Thinking Adam is always better than SGD

Adam often trains faster early, but SGD or Momentum can still be very competitive when the learning rate is tuned well.

Mistake 2: Forgetting Adam has two memories

Adam tracks both \(m_t\) for smoothed gradient direction, and \(v_t\) for squared-gradient magnitude. These two terms do different jobs.

Mistake 3: Confusing Momentum’s \(u_t\) with Adam’s \(v_t\)

Momentum’s \(u_t\) stores an accumulated direction. Adam’s \(v_t\) stores squared-gradient magnitude. They are not the same concept.

Mistake 4: Ignoring bias correction

Adam starts \(m_0\) and \(v_0\) at zero, so early estimates are biased toward zero. Bias correction helps fix this in the first training steps.

Mistake 5: Thinking adaptive scaling means no tuning

Adam adjusts step sizes automatically, but the learning rate still matters. A bad learning rate can still make Adam unstable or slow.

Summary

Takeaway

SGD follows the current gradient directly using one global learning rate.

Momentum smooths the direction by remembering past gradients. RMSProp rescales steps using recent squared gradients.

Adam combines both ideas: it smooths the update direction with \(m_t\), rescales each parameter with \(v_t\), and uses bias correction to make early updates more reliable. This often makes Adam fast and stable early in training, but it is not automatically better than SGD in every situation.

Adam vs SGDOptimizer Comparison

Interactive lesson

Adam vs SGDOptimizer Comparison

3D loss surface

Background

Important formulas

Pros and cons

SGD

Pros

Cons

Momentum

Pros

Cons

Adam

Pros

Cons

Example and mistake

Quick Example

Common mistakes

Mistake 1: Thinking Adam is always better than SGD

Mistake 2: Forgetting Adam has two memories

Mistake 3: Confusing Momentum’s \(u_t\) with Adam’s \(v_t\)

Mistake 4: Ignoring bias correction

Mistake 5: Thinking adaptive scaling means no tuning

Takeaway

Adam vs SGD
Optimizer Comparison

Adam vs SGD
Optimizer Comparison