Multi-Layer Perceptron & Backpropagation

XOR, hidden-layer representations, the chain rule made visible, and depth as a compositional shortcut.

Architecture

Enable hidden layer

Off → network reduces to an SLP. Watch it fail on XOR.

Hidden activation

output neuron is always sigmoid, paired with binary cross-entropy (BCE) loss — the standard loss for 0/1 classification

Hidden units 2

Training

Learning rate η 0.50

epoch0
loss—
max |err|—
statusidle

Network — weights & activations

positive weight negative weight line thickness ∝ |w| node fill ∝ mean activation over the 4 XOR points

Input space (x₁, x₂)

0/4 correct

Hidden-unit decision lines (dashed) · XOR points marked ✓ if confidently correct, × if confidently wrong

1 0.5 0

Hidden-unit space (h₁, h₂)

The four XOR points remapped by the hidden layer. Watch them separate.

Training loss

Why XOR? — linear separability, and what breaks it

A single-layer perceptron computes a weighted sum of its inputs and thresholds the result. Geometrically, the set of inputs where that sum equals the threshold is a hyperplane — in two dimensions, a straight line. The SLP can only distinguish two classes if some line separates them. That’s called linear separability, and it’s the SLP’s expressive ceiling.

AND separable

x₁	x₂	y
0	0	0
0	1	0
1	0	0
1	1	1

AND requires both inputs on. The single “on” corner is easily cut off by a line.

OR separable

x₁	x₂	y
0	0	0
0	1	1
1	0	1
1	1	1

OR is the mirror image: one “off” corner, easily cut away.

XOR not separable

x₁	x₂	y
0	0	0
0	1	1
1	0	1
1	1	0

XOR puts the two “on” corners on opposite diagonals. No straight line works.

XOR is the simplest boolean function that’s not linearly separable, which is why it became the canonical counterexample in Minsky & Papert’s 1969 Perceptrons. The impossibility is easy to see: any line that puts (0,1) and (1,0) on its “on” side necessarily puts at least one of the off-diagonal corners there too.

What the hidden layer does: each hidden unit draws its own line in input space — those are the dashed lines in the Input Space card above. With two hidden units, the combined response in the (h₁, h₂) plane — the Hidden-Unit Space card — is a remapping under which the four XOR points become linearly separable. The output unit then does nothing more than an SLP on that new representation. The hidden layer is doing exactly one identifiable thing: learning a feature space where the target is linearly separable. That framing generalizes directly — deep learning is representation learning.

Try: toggle the hidden layer off and train — the network becomes an SLP and the loss plateaus at chance. Turn it back on and watch both the input-space boundary curl around and the four points in hidden space untangle. With only 2 hidden units, bad initializations sometimes get stuck; hit Reset and try again. Bumping to 3 hidden units makes the problem robustly solvable across all activations.

Notation

Every MLP that follows in this explorer is described by a short piece of notation. 2-8-3 means 2 input units, 8 hidden units, 3 output units. Extra numbers just extend the string: 2-8-8-3 adds a second hidden layer. The first number is always the input size (how many features each example has); the last is always the output size (how many classes, or 1 for regression); everything in between is hidden.

2-8-3

2-8-8-3

Input units on the left, output units on the right, hidden units in the middle. Output units are tinted to match the three classes the rest of this tab will use.

This tab sets up a 2-8-3 network on a three-class 2-D problem.

Activation functions

Each hidden unit computes a pre-activation z = w·x + b — a linear combination of its inputs plus a bias — and then passes z through a nonlinear activation function. This tab lets you pick from three standard choices:

Sigmoid / Logisticσ(z) = 1 / (1 + e^−z)

Tanhtanh(z) = (e^z − e^−z) / (e^z + e^−z)

ReLU (Rectified Linear Unit)max(0, z)

The three curves, plotted over z ∈ [−3, 3]. Sigmoid and tanh saturate at their bounds (flat tails, near-zero derivatives); ReLU is unbounded above with slope 1 for z > 0.

Sigmoid / Logistic is smooth and bounded to [0, 1] (“sigmoid” is a general term for any S-shaped curve; however, in ML shorthand it usually means specifically the logistic function shown here). Tanh (hyperbolic tangent) is smooth and bounded to [−1, 1], and zero-centered — a helpful property for gradient flow. ReLU (Rectified Linear Unit, max(0, z)) is the modern default: unbounded on the positive side, so gradients don’t vanish when z is large; trivially cheap to compute; and the only nonlinearity is a kink at zero. Sigmoid and tanh suffer from vanishing gradients in deep networks because their derivatives shrink to zero at the saturated tails.

Output layer: softmax and cross-entropy

The output layer has one unit per class — three in this tab — one for each possible answer. Each output unit produces its own pre-activation z_k (a linear combination of the hidden-layer activations plus a bias), which we call a logit. The raw logits z₁, z₂, z₃ are unbounded real numbers representing each class’s “strength”; what we actually want is a probability distribution over classes (three numbers in [0, 1] that sum to 1). During training, we compare that predicted distribution to a one-hot target vector y: a 3-vector with a 1 at the true-class position and 0 elsewhere. A class-1 example, for instance, has y = (0, 1, 0).

The output layer uses softmax to turn the 3-tuple of logits into that probability distribution. Each logit is divided by a temperature T, exponentiated, and normalized so the three results sum to 1: p_k = exp(z_k/T) / ∑_j exp(z_j/T). This is the Boltzmann distribution from statistical mechanics — the logits play the role of negative energies, and the inverse temperature β = 1/T sets how peaky the distribution is. Low T (high β, “cold”) concentrates probability on the largest logit, approaching a hard argmax. High T (low β, “hot”) spreads probability toward uniform, approaching maximum entropy over the classes. This tab uses T = 1 (the standard classification default); our transformers widget makes T interactive so you can see how it shapes next-token sampling.

Softmax pairs with cross-entropy loss, which measures how far the predicted distribution p is from the true distribution y. For a single example with K classes, it is the negative log-probability the model assigned to the correct class (from p), weighted by the target probability (from y). That is, during training, for a specific input example that produces a K-class probability estimate p = (p₁, …, p_K), the loss for that one training input is:

L = −∑_k=1..K y_k log p_k = −log p_c

Here “log” means natural log, by convention in machine learning. Because y is one-hot, every y_i = 0 except for the index i = c of the correct class where y_c = 1, which is why the per-example loss collapses to L = −log p_c, the negative log of the probability for the correct class. If the model assigns high probability to the correct class (p_c → 1), the loss for that example’s prediction goes to 0; if it assigns low probability (p_c → 0), the loss for that prediction blows up, punishing confident wrong answers hard.

Over a training batch of N examples where each example n ∈ {1, …, N} has per-example loss L_n, the training procedure uses the average per-example loss L_batch = (1/N) ∑_n=1..N L_n.

With T = 1, the softmax probabilities combined with the cross-entropy loss mean that the gradient propagated back from the output into the network simplifies cleanly to p − y (the predicted distribution minus the one-hot target) — the same elegant form sigmoid + binary cross-entropy has in Tabs 3 and 4.

Controls

Pick a hidden activation, pick a learning rate, and click Train to watch the 2-8-3 network learn a three-ring classification problem. Reset gives fresh random weights on the same problem. The interesting comparison isn’t just how fast each activation drives loss down — it’s also how differently the decision regions are shaped once training settles.

Hidden activation

Learning rate 0.05

Actions

epoch 0

Decision regions

Class 0 Class 1 Class 2

Cross-entropy loss

What to watch for

Sigmoid / Logisticslow, smooth

Its derivative σ′(z) = σ(z)(1−σ(z)) peaks at just 0.25 and decays to zero on both tails. Gradients are small from the start and shrink further as hidden units saturate, so loss descends sluggishly. Hidden activations are all in [0, 1], so the next layer sees a positive-mean input — a mild handicap for optimization. Decision boundaries come out smoothly curved.

Tanhsmooth, faster than sigmoid

Saturates the same way, but with derivative tanh′(z) = 1 − tanh²(z) peaking at 1.0 — four times sigmoid’s peak. Gradient signal is stronger near zero, so the loss drops more briskly. The zero-centered range [−1, 1] lets hidden activations balance positive and negative, which is easier on subsequent layers. Boundaries are smooth, often nearly circular on this ring problem.

ReLUfast; piecewise-linear

Derivative is exactly 1 for z > 0 — no saturation, no gradient decay. Training typically descends fastest of the three when it works. The signature visual: boundaries are piecewise linear, straight segments meeting at kinks where individual hidden units cross zero. Watch for dead neurons, too: a unit stuck with z < 0 for every training example outputs 0 and gets zero gradient, permanently. Manifests as loss plateaus or “missing” wedges in the boundary.

Walkthrough

Idle Press Step to begin

Click Step to advance one sub-step through forward → backward → update. The loop then begins a fresh forward pass on the new weights. Play auto-advances.

Example & training

x₁	x₂	y
0	0	0
1	0	1
0	1	1
1	1	0

The highlighted row is the current training example. The walkthrough cycles through all four XOR pairs, one SGD update per example.

Learning rate η 0.50

Current weights & biases

SGD update

w ← w − η · ∂L/∂w

Gradients populate during the backward pass.

Computation graph

See Tab 3 for more background on the backpropagation process

value (forward) gradient (backward) learnable parameter highlighted edges carry currently-flowing signal

Forward pass — values at each node

Press Step to begin the walkthrough.

Backward pass — gradients at each node & weight

Will populate during the backward pass.

Why the chain rule, and what it means for one weight

Training a network means adjusting each weight w^k_ij (from the i-th neuron in layer k−1 to the j-th neuron in layer k) to reduce the loss L. The SGD update for a single weight is

w^k_ij ← w^k_ij − η · ∂L/∂w^k_ij

So the one thing we really need is ∂L/∂w^k_ij — the gradient of the loss with respect to that one weight. Backpropagation is the efficient recipe for computing it at every weight in the network.

Hover over a weight in the map below to identify it; click a weight to see how the chain rule assembles its gradient. The zoomed view updates to match.

Selected weight:

Chain rule decomposition

The upstream factor itself follows the same pattern (fan-out sum)

Common activation functions — a(z) and a′(z)

Name	a(z)	a′(z)
Logistic sigmoid σ	1 / (1 + e^−z)	σ(z) · (1 − σ(z))
Hyperbolic tangent tanh	(e^z − e^−z) / (e^z + e^−z)	1 − tanh²(z)
Rectified linear unit ReLU	max(0, z)	1 if z > 0, else 0
Linear (identity)	z	1

In practice, all neurons in a given hidden layer typically use the same activation function (most often ReLU for deep networks, or tanh for shallower ones). The output layer’s activation is chosen to match the task: sigmoid for binary classification (paired with BCE loss), softmax for multi-class classification (paired with cross-entropy), or linear for regression (paired with squared error).

Common loss functions — L and ∂L/∂ŷ

Name	L	∂L/∂ŷ
Binary cross-entropy BCE	−y·log(ŷ) − (1−y)·log(1−ŷ)	(ŷ − y) / (ŷ·(1−ŷ))
Squared error SE	½(ŷ − y)²	ŷ − y

Simplification when pairings align. At an output neuron, the product ∂L/∂ŷ · a′(z) often collapses. BCE + sigmoid: the sigmoid's ŷ(1−ŷ) cancels BCE's denominator, giving the clean ∂L/∂z = ŷ − y. Squared error + linear output: the linear derivative is 1, so ∂L/∂z = ŷ − y too. These conveniences are why those pairings are standard.

Aside — weight sharing (CNNs, RNNs)

Everything above assumes each weight appears once in the computation graph. Convolutional nets and recurrent nets deliberately share weights — the same kernel is applied at many spatial positions, or the same recurrent matrix at every time step. When a weight appears in multiple places, the chain rule naturally sums contributions from each occurrence: ∂L/∂w becomes a sum over locations rather than a single-path product. The per-location derivative is computed the same way; all those per-location gradients are added together before the SGD update.

The chain rule's three-factor shape works at every weight in every layer. What differs is how the upstream factor ∂L/∂a^k_i is computed — for output neurons it comes directly from the loss; for hidden neurons it comes from a sum of downstream gradients. That's the whole recursion. The specific clean form ∂L/∂z^K_i = ŷ − y you'll see in the Backprop Walkthrough tab is an arithmetic simplification that happens when BCE pairs with sigmoid output — those two factors multiply into that tidy difference. Convenient, not structural.