Interactive prototype

Activation Functions Comparison

Compare ReLU, Leaky ReLU, Sigmoid, and Tanh to see how nonlinearities shape a neuron response.

Structured teaching notes

Connect the interaction to the core idea.

These notes are written to sit below the interactive prototype, preserve the same teaching flow, and help the learner name what the visualization is showing.

Background

Activation functions introduce nonlinearity into a neural network, allowing it to model complex relationships instead of collapsing into a single linear mapping. After a neuron computes a weighted sum, the activation decides how strongly that signal should pass forward.

Different activations suit different roles. ReLU is common in hidden layers because it is cheap and usually trains well. Sigmoid and tanh are bounded, which can be useful for probabilities or centered outputs, but they can saturate and slow down learning in deep networks.

Important formulas

ReLU(x) = max(0, x)

ReLU outputs zero for negative inputs and returns the input itself for positive inputs.

σ(x) = 1 / (1 + e^-x)

Sigmoid maps a real-valued input to the interval (0, 1).

tanh(x) = (e^x - e^-x) / (e^x + e^-x)

Hyperbolic tangent maps inputs to the interval (-1, 1) and is zero-centered.

Pros

Introduce nonlinearity so networks can model patterns beyond linear transformations.
ReLU maintains a constant gradient on the positive side, which often speeds up training.
Sigmoid and tanh give bounded outputs that are useful in probability or normalization settings.

Cons

Sigmoid and tanh can saturate, producing vanishing gradients.
ReLU can produce "dead neurons" if units stay inactive for all training examples.
Using the wrong activation in the wrong place can slow or destabilize optimization.

Quick example

For x = -2, ReLU gives 0, sigmoid gives about 0.12, and tanh gives about -0.96. The same weighted sum can therefore produce very different downstream behavior depending on the activation.

Common mistake

A common error is to use the same activation everywhere without thinking about the task. For example, using a sigmoid as the final layer for a multi-class problem gives the wrong output structure, while using saturated activations too aggressively can make training slow and unstable.