Module 01 / Training Signal
Loss Functions &
Training Signal
How neural networks turn wrong predictions into gradients for learning.
Explore
Interactive lesson
Loss computation walkthrough
Mini-batch view
Background
Background
A neural network does not automatically know whether its prediction is good or bad. It only produces an output, such as a number, a probability, or a vector of class scores.
A loss function compares the model prediction \(\hat{y}\) with the true target \(y\), then turns the mistake into one scalar value.
The core idea is:
The loss tells the model how wrong it is. The gradient tells the optimizer how the weights should change.
During training, the model usually does not learn from one sample at a time. It learns from a mini-batch, where the losses from several samples are averaged into one training objective.
Notation
Important formulas
General loss
The loss function compares the prediction \(\hat{y}\) with the true target \(y\).
Mini-batch loss
Where \(m\) is the number of samples in the mini-batch.
Mean Squared Error
Commonly used for regression tasks.
Binary Cross-Entropy
Commonly used for binary classification.
Multi-class Cross-Entropy
Commonly used for multi-class classification.
Training signal
The gradient tells the optimizer how changing the weight \(W\) will change the loss. For Mean Squared Error: \(\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)\). This gradient is the signal that moves the model prediction in the right direction.
Tradeoffs
Pros and cons
Pros
- Gives the model a clear objective. The model knows what it is trying to minimize during training.
- Turns errors into gradients. Loss functions create differentiable signals that backpropagation can use.
- Works naturally with mini-batches. Losses from multiple samples can be averaged into one stable training objective.
- Can match different task types. Regression, binary classification, and multi-class classification each have suitable loss functions.
Cons
- Lower loss does not always mean better real-world performance. A model can reduce training loss but still perform poorly on new data.
- Wrong loss can teach the wrong behavior. Using a loss that does not match the task can make training inefficient or misleading.
- Sensitive to outliers or imbalance. Some losses can be dominated by extreme values or majority classes.
- Not every metric is easy to optimize. Metrics like accuracy are useful for evaluation, but they are not always smooth enough for gradient-based training.
Practice
Quick example and common mistakes
Quick Example
Suppose a regression model predicts \(\hat{y} = 1.2\) and the true target is \(y = 2.4\). Using Mean Squared Error:
The loss value is \(1.44\), but the model does not update weights directly from this number. Backpropagation first computes:
The negative gradient means the prediction is too low. For a mini-batch, each sample has its own loss:
Common Mistakes
- Thinking loss is the same as accuracy. Accuracy only counts whether the prediction is right or wrong. Loss also measures how confident or how far away the prediction is.
- Using Mean Squared Error for every task. Mean Squared Error is natural for regression. For classification, cross-entropy is usually a better choice because it works directly with probabilities.
- Forgetting mini-batch averaging. Training usually optimizes the average loss across a mini-batch, not just the loss from one sample.
- Thinking the loss updates weights directly. The loss is only a scalar objective. Weights are updated after backpropagation computes gradients.
- Believing lower training loss always means a better model. A model can memorize the training data and achieve low training loss, but still fail on unseen examples.
Takeaway
Takeaway
A loss function converts prediction errors into a scalar objective.
Backpropagation converts that objective into gradients.
Those gradients become the training signal that tells the optimizer how to update the model.