Loss Functions & Training Signal | Topic Design Preview

Explore

Interactive lesson

Loss family

Choose a loss that matches the prediction type and target format.

Regression values

ŷ = 1.20 · y = 2.40

Prediction ŷ

1.20

Target y

2.40

Loss computation walkthrough

Step 1

1.20

Current prediction from the model.

Step 2

2.40

Ground-truth target or label.

Step 3

-1.20

Comparison term used by the loss.

Step 4

1.44

Per-sample loss for this one example.

Step 5

1.52

Average loss across the current batch.

Mini-batch view

Background

A neural network does not automatically know whether its prediction is good or bad. It only produces an output, such as a number, a probability, or a vector of class scores.

A loss function compares the model prediction \(\hat{y}\) with the true target \(y\), then turns the mistake into one scalar value.

The core idea is:

\[\text{prediction} \rightarrow \text{loss} \rightarrow \text{gradient} \rightarrow \text{update}\]

The loss tells the model how wrong it is. The gradient tells the optimizer how the weights should change.

During training, the model usually does not learn from one sample at a time. It learns from a mini-batch, where the losses from several samples are averaged into one training objective.

Notation

Important formulas

General loss

\[L(\hat{y}, y)\]

The loss function compares the prediction \(\hat{y}\) with the true target \(y\).

Mini-batch loss

\[J = \frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})\]

Where \(m\) is the number of samples in the mini-batch.

Mean Squared Error

\[L = (\hat{y} - y)^2\]

Commonly used for regression tasks.

Binary Cross-Entropy

\[L = -\left[y\log(\hat{y}) + (1-y)\log(1-\hat{y})\right]\]

Commonly used for binary classification.

Multi-class Cross-Entropy

\[L = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)\]

Commonly used for multi-class classification.

Training signal

\[\frac{\partial J}{\partial W}\]

The gradient tells the optimizer how changing the weight \(W\) will change the loss. For Mean Squared Error: \(\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)\). This gradient is the signal that moves the model prediction in the right direction.

Tradeoffs

Pros and cons

Pros

Gives the model a clear objective. The model knows what it is trying to minimize during training.
Turns errors into gradients. Loss functions create differentiable signals that backpropagation can use.
Works naturally with mini-batches. Losses from multiple samples can be averaged into one stable training objective.
Can match different task types. Regression, binary classification, and multi-class classification each have suitable loss functions.

Cons

Lower loss does not always mean better real-world performance. A model can reduce training loss but still perform poorly on new data.
Wrong loss can teach the wrong behavior. Using a loss that does not match the task can make training inefficient or misleading.
Sensitive to outliers or imbalance. Some losses can be dominated by extreme values or majority classes.
Not every metric is easy to optimize. Metrics like accuracy are useful for evaluation, but they are not always smooth enough for gradient-based training.

Practice

Quick example and common mistakes

Quick Example

Suppose a regression model predicts \(\hat{y} = 1.2\) and the true target is \(y = 2.4\). Using Mean Squared Error:

\[L = (\hat{y} - y)^2\]

\[L = (1.2 - 2.4)^2 = (-1.2)^2 = 1.44\]

The loss value is \(1.44\), but the model does not update weights directly from this number. Backpropagation first computes:

\[\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) = 2(1.2 - 2.4) = -2.4\]

The negative gradient means the prediction is too low. For a mini-batch, each sample has its own loss:

\[L_1 = 5.76,\quad L_2 = 2.40,\quad L_3 = 0.42,\quad L_4 = 0.02\]

\[J = \frac{5.76 + 2.40 + 0.42 + 0.02}{4} = 2.15\]

Common Mistakes

Thinking loss is the same as accuracy. Accuracy only counts whether the prediction is right or wrong. Loss also measures how confident or how far away the prediction is.
Using Mean Squared Error for every task. Mean Squared Error is natural for regression. For classification, cross-entropy is usually a better choice because it works directly with probabilities.
Forgetting mini-batch averaging. Training usually optimizes the average loss across a mini-batch, not just the loss from one sample.
Thinking the loss updates weights directly. The loss is only a scalar objective. Weights are updated after backpropagation computes gradients.
Believing lower training loss always means a better model. A model can memorize the training data and achieve low training loss, but still fail on unseen examples.

Takeaway

A loss function converts prediction errors into a scalar objective.

Backpropagation converts that objective into gradients.

Those gradients become the training signal that tells the optimizer how to update the model.