Module 02 / Optimization
Mini-Batch
Training
Compare full-batch, mini-batch, and stochastic gradient descent to see how batch size shapes optimization paths, update frequency, and training noise.
Explore
Interactive lesson
Complete current prototype loaded without changing its teaching content.
Mini-batch Training and Batch Size Intuition
compare optimization paths and update rhythm: full batch should look smooth, mini-batch should look moderately wavy, stochastic should look visibly noisy — and all three should meet the same minimum
large landscape comparison
the three trajectories are intentionally separated: blue enters from upper-left, green from below, orange from the right — they only become close in the final basin region
compare update rhythm
FULL BATCH
one update after the whole dataset
MINI-BATCH
an update after each batch chunk
STOCHASTIC
one sample, one update, very frequent updates
selected method intuition
selected method formula
Context
Background
Batch size controls how many training examples the model uses before taking one optimization step.
In full batch gradient descent, the model looks at the entire dataset before each update. The gradient is stable because it uses all training examples, but the model only updates once per epoch.
In stochastic gradient descent, the model updates after seeing one training example. This makes the path very responsive, but also very noisy because one sample may not represent the whole dataset well.
Mini-batch training sits between these two extremes. The model uses a small group of examples before each update, giving a practical balance between stability, speed, memory cost, and GPU efficiency.
The core idea is: batch size changes the gradient estimate → gradient estimate changes the update path → update path changes training behavior.
Notation
Important formulas
Dataset Loss. The average loss over the whole training dataset.
- ·\(J(\theta)\) — average loss over the whole training dataset
- ·\(\theta\) — model parameters, such as weights and biases
- ·\(N\) — total number of training examples
- ·\(\ell_i(\theta)\) — loss from the \(i\)-th training example
Gradient Descent Update. The main parameter update rule. Batch size affects how \(g_t\) is estimated.
- ·\(\theta_t\) — parameters before the update
- ·\(\theta_{t+1}\) — parameters after the update
- ·\(\eta\) — learning rate, controls the step size
- ·\(g_t\) — gradient estimate used at step \(t\)
Full Batch Gradient. Stable, but updates only once per epoch.
- ·\(g_{\text{full}}\) — gradient computed using the entire dataset
- ·\(\nabla_{\theta}J(\theta)\) — gradient of the full dataset loss w.r.t. parameters
- ·\(\nabla_{\theta}\ell_i(\theta)\) — gradient from one training example
Mini-Batch Loss. The loss most deep learning frameworks compute during one training step.
- ·\(J_{\mathcal{B}}(\theta)\) — average loss over one mini-batch
- ·\(\mathcal{B}\) — the set of examples inside the current mini-batch
- ·\(B\) — batch size: number of examples in the mini-batch
- ·\(i \in \mathcal{B}\) — example \(i\) belongs to the current mini-batch
Mini-Batch Gradient. Not perfectly exact, but stable enough and much cheaper than full batch.
- ·\(g_{\mathcal{B}}\) — gradient estimated from the current mini-batch
- ·\(B\) — number of examples used to estimate the gradient
- ·\(\nabla_{\theta}\ell_i(\theta)\) — gradient contribution from one example
Stochastic Gradient. The special case where \(B = 1\). Very frequent updates, but the gradient direction can be very noisy.
- ·\(g_i\) — gradient estimated from one training example
- ·\(\ell_i(\theta)\) — loss from one example
- ·\(i\) — the single example used for the update
Updates Per Epoch. Smaller batch size gives more updates per epoch; larger gives fewer.
- ·\(N\) — total number of training examples
- ·\(B\) — batch size
- ·\(\lceil \cdot \rceil\) — ceiling function, rounds up to the nearest integer
Gradient Noise. Larger batches produce less noisy estimates; smaller batches produce noisier but more responsive updates.
- ·\(\operatorname{Var}(g_{\mathcal{B}})\) — variance (noise level) of the mini-batch gradient
- ·\(\sigma^2\) — approximate variance of gradients from individual samples
- ·\(B\) — batch size
Tradeoffs
Pros and cons
Small Batch
Pros
- Frequent updates. The model updates many times per epoch, so it can react quickly during training.
- Lower memory cost. Small batches require less GPU memory.
- Helpful noise. The noisy update path can sometimes help the model avoid overly sharp solutions.
Cons
- Noisy gradients. Each update may point in a less reliable direction.
- Less stable training. The loss curve may jump around more, especially early in training.
- Lower hardware efficiency. Very small batches may not fully use GPU parallelism.
Large Batch
Pros
- Smoother gradients. The update direction is more stable because it uses more examples.
- Better GPU parallelism. Large batches can make efficient use of modern hardware.
- Cleaner loss curve. Training often looks smoother because the gradient estimate has less noise.
Cons
- Higher memory cost. Large batches require more GPU memory.
- Fewer updates per epoch. The model updates less often during one pass through the dataset.
- May generalize worse. Very large batches can sometimes converge to sharper solutions if the learning rate is not tuned carefully.
Mini-Batch
Pros
- Practical balance. Mini-batches are stable enough, frequent enough, and efficient enough for most deep learning tasks.
- Flexible hyperparameter. Batch size can be adjusted without changing the model architecture.
- Standard workflow. Most neural networks are trained using mini-batches rather than pure full batch or pure stochastic updates.
Cons
- Needs tuning. The best batch size depends on the dataset, model, memory limit, and learning rate.
- Interacts with learning rate. Changing batch size often requires changing the learning rate too.
- Not automatically better. Common sizes like 32, 64, or 128 are a starting point, not a universal rule.
Practice
Quick example
Suppose the dataset has \(N = 1000\) training examples.
Stochastic gradient descent. Very frequent updates but each gradient is noisy — only one example per step.
Mini-batch training. Each update is more stable than using one example, but still more frequent than full batch.
Full batch gradient descent. The gradient is very stable, but the training path reacts slowly — only one update per epoch.
The goal is not always to use the largest batch. The goal is to choose a batch size that gives a good balance between stable gradients, frequent updates, memory cost, and generalization.
Watch out
Common mistakes
Thinking one epoch means one update. One epoch means the model has seen the whole dataset once. With mini-batches, the model usually updates many times inside one epoch.
Confusing batch size with dataset size. The dataset size \(N\) is the total number of training examples. The batch size \(B\) is only the number of examples used in one update step.
Thinking larger batch is always better. Larger batches give smoother gradients, but they also need more memory and may not always generalize better.
Keeping the same learning rate after changing batch size. Batch size changes the noise level of the gradient. When batch size changes significantly, the learning rate often needs to be tuned again.
Forgetting that losses are often averaged. Many frameworks average the loss over the batch. Increasing batch size does not simply multiply the gradient magnitude, but it usually makes the gradient estimate less noisy.
Summary
Takeaway
Batch size controls how many examples the model uses before each parameter update.
Small batches give noisy but frequent updates. Large batches give smoother but less frequent updates.
Mini-batch training is the practical middle ground: it balances gradient stability, update frequency, memory cost, and GPU efficiency.