Multi-Layer Perceptron & Backpropagation

XOR, hidden-layer representations, the chain rule made visible, and depth as a compositional shortcut.

© 2026 Theodore P. PavlicMIT License
Notation

Every MLP that follows in this explorer is described by a short piece of notation. 2-8-3 means 2 input units, 8 hidden units, 3 output units. Extra numbers just extend the string: 2-8-8-3 adds a second hidden layer. The first number is always the input size (how many features each example has); the last is always the output size (how many classes, or 1 for regression); everything in between is hidden.

2-8-3
2-8-8-3
Input units on the left, output units on the right, hidden units in the middle. Output units are tinted to match the three classes the rest of this tab will use.

This tab sets up a 2-8-3 network on a three-class 2-D problem.

Activation functions

Each hidden unit computes a pre-activation z = w·x + b — a linear combination of its inputs plus a bias — and then passes z through a nonlinear activation function. This tab lets you pick from three standard choices:

Sigmoid / Logisticσ(z) = 1 / (1 + e−z)
Tanhtanh(z) = (ez − e−z) / (ez + e−z)
ReLU (Rectified Linear Unit)max(0, z)
The three curves, plotted over z ∈ [−3, 3]. Sigmoid and tanh saturate at their bounds (flat tails, near-zero derivatives); ReLU is unbounded above with slope 1 for z > 0.

Sigmoid / Logistic is smooth and bounded to [0, 1] (“sigmoid” is a general term for any S-shaped curve; however, in ML shorthand it usually means specifically the logistic function shown here). Tanh (hyperbolic tangent) is smooth and bounded to [−1, 1], and zero-centered — a helpful property for gradient flow. ReLU (Rectified Linear Unit, max(0, z)) is the modern default: unbounded on the positive side, so gradients don’t vanish when z is large; trivially cheap to compute; and the only nonlinearity is a kink at zero. Sigmoid and tanh suffer from vanishing gradients in deep networks because their derivatives shrink to zero at the saturated tails.

Output layer: softmax and cross-entropy

The output layer has one unit per class — three in this tab — one for each possible answer. Each output unit produces its own pre-activation zk (a linear combination of the hidden-layer activations plus a bias), which we call a logit. The raw logits z1, z2, z3 are unbounded real numbers representing each class’s “strength”; what we actually want is a probability distribution over classes (three numbers in [0, 1] that sum to 1). During training, we compare that predicted distribution to a one-hot target vector y: a 3-vector with a 1 at the true-class position and 0 elsewhere. A class-1 example, for instance, has y = (0, 1, 0).

The output layer uses softmax to turn the 3-tuple of logits into that probability distribution. Each logit is divided by a temperature T, exponentiated, and normalized so the three results sum to 1: pk = exp(zk/T) / ∑j exp(zj/T). This is the Boltzmann distribution from statistical mechanics — the logits play the role of negative energies, and the inverse temperature β = 1/T sets how peaky the distribution is. Low T (high β, “cold”) concentrates probability on the largest logit, approaching a hard argmax. High T (low β, “hot”) spreads probability toward uniform, approaching maximum entropy over the classes. This tab uses T = 1 (the standard classification default); our transformers widget makes T interactive so you can see how it shapes next-token sampling.

Softmax pairs with cross-entropy loss, which measures how far the predicted distribution p is from the true distribution y. For a single example with K classes, it is the negative log-probability the model assigned to the correct class (from p), weighted by the target probability (from y). That is, during training, for a specific input example that produces a K-class probability estimate p = (p1, …, pK), the loss for that one training input is:

L = −∑k=1..K yk log pk = −log pc

Here “log” means natural log, by convention in machine learning. Because y is one-hot, every yi = 0 except for the index i = c of the correct class where yc = 1, which is why the per-example loss collapses to L = −log pc, the negative log of the probability for the correct class. If the model assigns high probability to the correct class (pc → 1), the loss for that example’s prediction goes to 0; if it assigns low probability (pc → 0), the loss for that prediction blows up, punishing confident wrong answers hard.

Over a training batch of N examples where each example n ∈ {1, …, N} has per-example loss Ln, the training procedure uses the average per-example loss Lbatch = (1/N) ∑n=1..N Ln.

With T = 1, the softmax probabilities combined with the cross-entropy loss mean that the gradient propagated back from the output into the network simplifies cleanly to p − y (the predicted distribution minus the one-hot target) — the same elegant form sigmoid + binary cross-entropy has in Tabs 3 and 4.

Controls

Pick a hidden activation, pick a learning rate, and click Train to watch the 2-8-3 network learn a three-ring classification problem. Reset gives fresh random weights on the same problem. The interesting comparison isn’t just how fast each activation drives loss down — it’s also how differently the decision regions are shaped once training settles.

epoch 0
Decision regions
Class 0 Class 1 Class 2
Cross-entropy loss
What to watch for
Sigmoid / Logisticslow, smooth

Its derivative σ′(z) = σ(z)(1−σ(z)) peaks at just 0.25 and decays to zero on both tails. Gradients are small from the start and shrink further as hidden units saturate, so loss descends sluggishly. Hidden activations are all in [0, 1], so the next layer sees a positive-mean input — a mild handicap for optimization. Decision boundaries come out smoothly curved.

Tanhsmooth, faster than sigmoid

Saturates the same way, but with derivative tanh′(z) = 1 − tanh²(z) peaking at 1.0four times sigmoid’s peak. Gradient signal is stronger near zero, so the loss drops more briskly. The zero-centered range [−1, 1] lets hidden activations balance positive and negative, which is easier on subsequent layers. Boundaries are smooth, often nearly circular on this ring problem.

ReLUfast; piecewise-linear

Derivative is exactly 1 for z > 0 — no saturation, no gradient decay. Training typically descends fastest of the three when it works. The signature visual: boundaries are piecewise linear, straight segments meeting at kinks where individual hidden units cross zero. Watch for dead neurons, too: a unit stuck with z < 0 for every training example outputs 0 and gets zero gradient, permanently. Manifests as loss plateaus or “missing” wedges in the boundary.