Deep Learning Visualized
Topic design preview Back to module

Module 02 / Optimization

Mini-Batch
Training

Compare full-batch, mini-batch, and stochastic gradient descent to see how batch size shapes optimization paths, update frequency, and training noise.

Visual design preview Original interaction preserved

Interactive lesson

Complete current prototype loaded without changing its teaching content.

Mini-batch Training and Batch Size Intuition

compare optimization paths and update rhythm: full batch should look smooth, mini-batch should look moderately wavy, stochastic should look visibly noisy — and all three should meet the same minimum

Mini-batch size 4
timeline 0%

large landscape comparison

the three trajectories are intentionally separated: blue enters from upper-left, green from below, orange from the right — they only become close in the final basin region

same minimum
clear visual contrast
topic-first teaching layout

compare update rhythm

FULL BATCH

one update after the whole dataset

MINI-BATCH

an update after each batch chunk

STOCHASTIC

one sample, one update, very frequent updates

selected method intuition

selected method formula

Use the method chips (Full Batch / Mini-Batch / Stochastic) and the batch-size slider to explore how batch size changes the optimization path and update rhythm.

Background

Batch size controls how many training examples the model uses before taking one optimization step.

In full batch gradient descent, the model looks at the entire dataset before each update. The gradient is stable because it uses all training examples, but the model only updates once per epoch.

In stochastic gradient descent, the model updates after seeing one training example. This makes the path very responsive, but also very noisy because one sample may not represent the whole dataset well.

Mini-batch training sits between these two extremes. The model uses a small group of examples before each update, giving a practical balance between stability, speed, memory cost, and GPU efficiency.

The core idea is: batch size changes the gradient estimate → gradient estimate changes the update path → update path changes training behavior.

Important formulas

\[ J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell_i(\theta) \]

Dataset Loss. The average loss over the whole training dataset.

  • ·\(J(\theta)\) — average loss over the whole training dataset
  • ·\(\theta\) — model parameters, such as weights and biases
  • ·\(N\) — total number of training examples
  • ·\(\ell_i(\theta)\) — loss from the \(i\)-th training example
\[ \theta_{t+1} = \theta_t - \eta\, g_t \]

Gradient Descent Update. The main parameter update rule. Batch size affects how \(g_t\) is estimated.

  • ·\(\theta_t\) — parameters before the update
  • ·\(\theta_{t+1}\) — parameters after the update
  • ·\(\eta\) — learning rate, controls the step size
  • ·\(g_t\) — gradient estimate used at step \(t\)
\[ g_{\text{full}} = \nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta}\ell_i(\theta) \]

Full Batch Gradient. Stable, but updates only once per epoch.

  • ·\(g_{\text{full}}\) — gradient computed using the entire dataset
  • ·\(\nabla_{\theta}J(\theta)\) — gradient of the full dataset loss w.r.t. parameters
  • ·\(\nabla_{\theta}\ell_i(\theta)\) — gradient from one training example
\[ J_{\mathcal{B}}(\theta) = \frac{1}{B} \sum_{i \in \mathcal{B}} \ell_i(\theta) \]

Mini-Batch Loss. The loss most deep learning frameworks compute during one training step.

  • ·\(J_{\mathcal{B}}(\theta)\) — average loss over one mini-batch
  • ·\(\mathcal{B}\) — the set of examples inside the current mini-batch
  • ·\(B\) — batch size: number of examples in the mini-batch
  • ·\(i \in \mathcal{B}\) — example \(i\) belongs to the current mini-batch
\[ g_{\mathcal{B}} = \nabla_{\theta} J_{\mathcal{B}}(\theta) = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_{\theta}\ell_i(\theta) \]

Mini-Batch Gradient. Not perfectly exact, but stable enough and much cheaper than full batch.

  • ·\(g_{\mathcal{B}}\) — gradient estimated from the current mini-batch
  • ·\(B\) — number of examples used to estimate the gradient
  • ·\(\nabla_{\theta}\ell_i(\theta)\) — gradient contribution from one example
\[ g_i = \nabla_{\theta}\ell_i(\theta) \qquad (B = 1) \]

Stochastic Gradient. The special case where \(B = 1\). Very frequent updates, but the gradient direction can be very noisy.

  • ·\(g_i\) — gradient estimated from one training example
  • ·\(\ell_i(\theta)\) — loss from one example
  • ·\(i\) — the single example used for the update
\[ \text{updates per epoch} = \left\lceil \frac{N}{B} \right\rceil \]

Updates Per Epoch. Smaller batch size gives more updates per epoch; larger gives fewer.

  • ·\(N\) — total number of training examples
  • ·\(B\) — batch size
  • ·\(\lceil \cdot \rceil\) — ceiling function, rounds up to the nearest integer
\[ \operatorname{Var}(g_{\mathcal{B}}) \approx \frac{\sigma^2}{B} \]

Gradient Noise. Larger batches produce less noisy estimates; smaller batches produce noisier but more responsive updates.

  • ·\(\operatorname{Var}(g_{\mathcal{B}})\) — variance (noise level) of the mini-batch gradient
  • ·\(\sigma^2\) — approximate variance of gradients from individual samples
  • ·\(B\) — batch size

Pros and cons

Small Batch

Pros

  • Frequent updates. The model updates many times per epoch, so it can react quickly during training.
  • Lower memory cost. Small batches require less GPU memory.
  • Helpful noise. The noisy update path can sometimes help the model avoid overly sharp solutions.

Cons

  • Noisy gradients. Each update may point in a less reliable direction.
  • Less stable training. The loss curve may jump around more, especially early in training.
  • Lower hardware efficiency. Very small batches may not fully use GPU parallelism.

Large Batch

Pros

  • Smoother gradients. The update direction is more stable because it uses more examples.
  • Better GPU parallelism. Large batches can make efficient use of modern hardware.
  • Cleaner loss curve. Training often looks smoother because the gradient estimate has less noise.

Cons

  • Higher memory cost. Large batches require more GPU memory.
  • Fewer updates per epoch. The model updates less often during one pass through the dataset.
  • May generalize worse. Very large batches can sometimes converge to sharper solutions if the learning rate is not tuned carefully.

Mini-Batch

Pros

  • Practical balance. Mini-batches are stable enough, frequent enough, and efficient enough for most deep learning tasks.
  • Flexible hyperparameter. Batch size can be adjusted without changing the model architecture.
  • Standard workflow. Most neural networks are trained using mini-batches rather than pure full batch or pure stochastic updates.

Cons

  • Needs tuning. The best batch size depends on the dataset, model, memory limit, and learning rate.
  • Interacts with learning rate. Changing batch size often requires changing the learning rate too.
  • Not automatically better. Common sizes like 32, 64, or 128 are a starting point, not a universal rule.

Quick example

Suppose the dataset has \(N = 1000\) training examples.

\[ B = 1 \implies \left\lceil \frac{1000}{1} \right\rceil = 1000 \text{ updates / epoch} \]

Stochastic gradient descent. Very frequent updates but each gradient is noisy — only one example per step.

\[ B = 100 \implies \left\lceil \frac{1000}{100} \right\rceil = 10 \text{ updates / epoch} \]

Mini-batch training. Each update is more stable than using one example, but still more frequent than full batch.

\[ B = 1000 \implies \left\lceil \frac{1000}{1000} \right\rceil = 1 \text{ update / epoch} \]

Full batch gradient descent. The gradient is very stable, but the training path reacts slowly — only one update per epoch.

The goal is not always to use the largest batch. The goal is to choose a batch size that gives a good balance between stable gradients, frequent updates, memory cost, and generalization.

Common mistakes

Thinking one epoch means one update. One epoch means the model has seen the whole dataset once. With mini-batches, the model usually updates many times inside one epoch.

Confusing batch size with dataset size. The dataset size \(N\) is the total number of training examples. The batch size \(B\) is only the number of examples used in one update step.

Thinking larger batch is always better. Larger batches give smoother gradients, but they also need more memory and may not always generalize better.

Keeping the same learning rate after changing batch size. Batch size changes the noise level of the gradient. When batch size changes significantly, the learning rate often needs to be tuned again.

Forgetting that losses are often averaged. Many frameworks average the loss over the batch. Increasing batch size does not simply multiply the gradient magnitude, but it usually makes the gradient estimate less noisy.

Takeaway

Batch size controls how many examples the model uses before each parameter update.

Small batches give noisy but frequent updates. Large batches give smoother but less frequent updates.

Mini-batch training is the practical middle ground: it balances gradient stability, update frequency, memory cost, and GPU efficiency.