Module 03 / Optimization
Adam vs SGD
Optimizer Comparison
Compare how SGD, Momentum, RMSProp, and Adam transform the same gradient into different parameter updates.
Explore
Interactive lesson
Complete current prototype loaded without changing its teaching content.
Adam vs SGD
Optimizer Comparison
3D loss surface
Drag to rotate freely. Each optimizer traces its own path on the same surface. The floor projection shows the (x, y) trajectory from above.
SGD: θ ← θ - η·g
SGD+M: v ← β·v + η·g | θ ← θ - v
Context
Background
An optimizer decides how a neural network’s parameters move after each gradient calculation.
The simplest optimizer is Stochastic Gradient Descent, or SGD. It takes the current gradient and moves the parameters in the opposite direction. This is simple and memory-efficient, but the path can zig-zag badly on narrow valleys or steep loss surfaces.
Momentum improves SGD by remembering past gradient directions. Instead of reacting only to the current gradient, it builds a smoother direction over time.
RMSProp uses a different idea. It tracks recent squared gradients and rescales the update for each parameter. Directions with consistently large gradients receive smaller effective steps.
Adam combines both ideas. It uses a Momentum-like term to smooth the direction and an RMSProp-like term to adapt the step size. This often makes Adam faster and more stable early in training, but it does not mean Adam is always better than SGD.
The core idea is:
same gradient → different optimizer rule → different update path
Notation
Important formulas
Mini-Batch Gradient Input.
- \(g_t\): The gradient used at training step \(t\).
- \(B\): The mini-batch size.
- \(\mathcal{B}_t\): The mini-batch used at step \(t\).
- \(\ell_i(\theta_t)\): The loss from training example \(i\).
- \(\theta_t\): The model parameters before the update.
All optimizers start from a gradient. The difference is how they transform this gradient into an update.
Plain SGD.
- \(\theta_t\): Parameters before the update.
- \(\theta_{t+1}\): Parameters after the update.
- \(\eta\): The learning rate.
- \(g_t\): The current mini-batch gradient.
SGD uses the current gradient directly. Every parameter shares the same global learning rate \(\eta\).
Momentum Idea.
- \(u_t\): The accumulated update direction.
- \(u_{t-1}\): The previous accumulated direction.
- \(\beta\): The momentum coefficient.
- \(g_t\): The current gradient.
- \(\eta\): The learning rate.
Momentum smooths the update direction. It helps reduce zig-zag movement when gradients change direction quickly.
RMSProp Idea.
- \(s_t\): The running average of squared gradients.
- \(\rho\): The decay rate for the squared-gradient average.
- \(g_t \odot g_t\): Element-wise square of the gradient.
- \(\epsilon\): A small number for numerical stability.
- \(\eta\): The global learning rate.
RMSProp rescales each parameter’s update. Parameters with consistently large gradients receive smaller effective steps.
Adam First Moment.
- \(m_t\): The first moment estimate.
- \(m_{t-1}\): The previous first moment estimate.
- \(\beta_1\): The decay rate for the first moment.
- \(g_t\): The current gradient.
This is Adam’s Momentum-like part. It tracks a smoothed gradient direction.
Adam Second Moment.
- \(v_t\): The second moment estimate.
- \(v_{t-1}\): The previous second moment estimate.
- \(\beta_2\): The decay rate for the second moment.
- \(g_t \odot g_t\): Element-wise squared gradient.
This is Adam’s RMSProp-like part. It tracks recent squared gradient magnitudes.
Adam Bias Correction.
- \(\hat{m}_t\): The bias-corrected first moment.
- \(\hat{v}_t\): The bias-corrected second moment.
- \(t\): The current training step.
- \(\beta_1^t\): The first-moment decay factor after \(t\) steps.
- \(\beta_2^t\): The second-moment decay factor after \(t\) steps.
Adam starts \(m_0\) and \(v_0\) at zero. Bias correction makes early estimates more reliable.
Adam Update.
- \(\theta_t\): Parameters before the update.
- \(\theta_{t+1}\): Parameters after the update.
- \(\hat{m}_t\): The corrected smoothed direction.
- \(\hat{v}_t\): The corrected squared-gradient scale.
- \(\eta\): The global learning rate.
- \(\epsilon\): A small value that prevents division by zero.
Adam uses \(\hat{m}_t\) for direction and \(\hat{v}_t\) for adaptive scaling.
Adam Effective Step Size.
- \(\eta_{\text{eff},j}\): The effective learning rate for parameter \(j\).
- \(j\): The index of one parameter.
- \(\hat{v}_{t,j}\): The corrected second moment for parameter \(j\).
- \(\eta\): The global learning rate.
- \(\epsilon\): A small value for numerical stability.
SGD uses one learning rate for all parameters. Adam gives each parameter its own effective step size.
Tradeoffs
Pros and cons
SGD
Pros
- Simple and lightweight: SGD is easy to understand and uses very little extra memory.
- Strong baseline: With good tuning, SGD can still perform very well in many deep learning tasks.
- Good final generalization: In some settings, SGD or SGD with Momentum can generalize very strongly.
Cons
- Sensitive to learning rate: If the learning rate is too large, training can diverge. If it is too small, training becomes slow.
- Can zig-zag: On narrow valleys, SGD may bounce from side to side instead of moving smoothly forward.
- Same step scale for every parameter: SGD does not automatically adjust different parameters based on gradient scale.
Momentum
Pros
- Smoother path: Momentum reduces noisy direction changes by carrying past movement forward.
- Faster in consistent directions: If gradients point in a similar direction for several steps, Momentum can accelerate progress.
- Useful bridge between SGD and Adam: It introduces the idea of remembering previous gradients.
Cons
- Can overshoot: Too much momentum may push the parameters past a good region.
- Adds another hyperparameter: The momentum coefficient \(\beta\) needs to be chosen carefully.
- Still uses one global learning rate: Momentum smooths direction, but it does not adapt step size per parameter.
Adam
Pros
- Fast early training: Adam often makes quick progress at the beginning of training.
- Adaptive per-parameter scaling: Each parameter receives its own effective step size based on recent squared gradients.
- More stable on difficult surfaces: Adam often handles ravines, curved valleys, and uneven gradient scales better than plain SGD.
- Combines two useful ideas: It uses Momentum-like direction smoothing and RMSProp-like adaptive scaling.
Cons
- Uses more memory: Adam stores both \(m_t\) and \(v_t\) for every parameter.
- More moving parts: Adam has \(\eta\), \(\beta_1\), \(\beta_2\), and \(\epsilon\), so it is less simple than SGD.
- Not always the best final choice: Adam is often strong early, but SGD or Momentum can still match or outperform it when tuned well.
- Can be overused as a default: Adam is convenient, but it should still be understood and tuned.
Practice
Example and mistake
Quick Example
Imagine a narrow curved valley on a loss surface.
SGD follows the current gradient directly. If the gradient is steep across the valley but shallow along the valley, SGD may bounce from one side to the other.
Momentum remembers previous directions. It can reduce the side-to-side bouncing and build speed along the valley, but it may overshoot when the valley curves.
Adam smooths the direction like Momentum and rescales the step size like RMSProp. If one direction has consistently large gradients, Adam shrinks the effective step in that direction. This often makes the trajectory more controlled.
The important difference is not that one optimizer receives a different gradient. They receive the same kind of gradient input. The difference is how each optimizer transforms that gradient into the next parameter update.
Review
Common mistakes
Mistake 1: Thinking Adam is always better than SGD
Adam often trains faster early, but SGD or Momentum can still be very competitive when the learning rate is tuned well.
Mistake 2: Forgetting Adam has two memories
Adam tracks both \(m_t\) for smoothed gradient direction, and \(v_t\) for squared-gradient magnitude. These two terms do different jobs.
Mistake 3: Confusing Momentum’s \(u_t\) with Adam’s \(v_t\)
Momentum’s \(u_t\) stores an accumulated direction. Adam’s \(v_t\) stores squared-gradient magnitude. They are not the same concept.
Mistake 4: Ignoring bias correction
Adam starts \(m_0\) and \(v_0\) at zero, so early estimates are biased toward zero. Bias correction helps fix this in the first training steps.
Mistake 5: Thinking adaptive scaling means no tuning
Adam adjusts step sizes automatically, but the learning rate still matters. A bad learning rate can still make Adam unstable or slow.
Summary
Takeaway
SGD follows the current gradient directly using one global learning rate.
Momentum smooths the direction by remembering past gradients. RMSProp rescales steps using recent squared gradients.
Adam combines both ideas: it smooths the update direction with \(m_t\), rescales each parameter with \(v_t\), and uses bias correction to make early updates more reliable. This often makes Adam fast and stable early in training, but it is not automatically better than SGD in every situation.