XOR, hidden-layer representations, the chain rule made visible, and depth as a compositional shortcut.
A single-layer perceptron computes a weighted sum of its inputs and thresholds the result. Geometrically, the set of inputs where that sum equals the threshold is a hyperplane — in two dimensions, a straight line. The SLP can only distinguish two classes if some line separates them. That’s called linear separability, and it’s the SLP’s expressive ceiling.
| x₁ | x₂ | y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
AND requires both inputs on. The single “on” corner is easily cut off by a line.
| x₁ | x₂ | y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
OR is the mirror image: one “off” corner, easily cut away.
| x₁ | x₂ | y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
XOR puts the two “on” corners on opposite diagonals. No straight line works.
XOR is the simplest boolean function that’s not linearly separable, which is why it became the canonical counterexample in Minsky & Papert’s 1969 Perceptrons. The impossibility is easy to see: any line that puts (0,1) and (1,0) on its “on” side necessarily puts at least one of the off-diagonal corners there too.
What the hidden layer does: each hidden unit draws its own line in input space — those are the dashed lines in the Input Space card above. With two hidden units, the combined response in the (h₁, h₂) plane — the Hidden-Unit Space card — is a remapping under which the four XOR points become linearly separable. The output unit then does nothing more than an SLP on that new representation. The hidden layer is doing exactly one identifiable thing: learning a feature space where the target is linearly separable. That framing generalizes directly — deep learning is representation learning.
Try: toggle the hidden layer off and train — the network becomes an SLP and the loss plateaus at chance. Turn it back on and watch both the input-space boundary curl around and the four points in hidden space untangle. With only 2 hidden units, bad initializations sometimes get stuck; hit Reset and try again. Bumping to 3 hidden units makes the problem robustly solvable across all activations.
Every MLP that follows in this explorer is described by a short piece of notation. 2-8-3 means 2 input units, 8 hidden units, 3 output units. Extra numbers just extend the string: 2-8-8-3 adds a second hidden layer. The first number is always the input size (how many features each example has); the last is always the output size (how many classes, or 1 for regression); everything in between is hidden.
2-8-32-8-8-3This tab sets up a 2-8-3 network on a three-class 2-D problem.
Each hidden unit computes a pre-activation z = w·x + b — a linear combination of its inputs plus a bias — and then passes z through a nonlinear activation function. This tab lets you pick from three standard choices:
σ(z) = 1 / (1 + e−z)tanh(z) = (ez − e−z) / (ez + e−z)max(0, z)z ∈ [−3, 3]. Sigmoid and tanh saturate at their bounds (flat tails, near-zero derivatives); ReLU is unbounded above with slope 1 for z > 0.Sigmoid / Logistic is smooth and bounded to [0, 1] (“sigmoid” is a general term for any S-shaped curve; however, in ML shorthand it usually means specifically the logistic function shown here). Tanh (hyperbolic tangent) is smooth and bounded to [−1, 1], and zero-centered — a helpful property for gradient flow. ReLU (Rectified Linear Unit, max(0, z)) is the modern default: unbounded on the positive side, so gradients don’t vanish when z is large; trivially cheap to compute; and the only nonlinearity is a kink at zero. Sigmoid and tanh suffer from vanishing gradients in deep networks because their derivatives shrink to zero at the saturated tails.
The output layer has one unit per class — three in this tab — one for each possible answer. Each output unit produces its own pre-activation zk (a linear combination of the hidden-layer activations plus a bias), which we call a logit. The raw logits z1, z2, z3 are unbounded real numbers representing each class’s “strength”; what we actually want is a probability distribution over classes (three numbers in [0, 1] that sum to 1). During training, we compare that predicted distribution to a one-hot target vector y: a 3-vector with a 1 at the true-class position and 0 elsewhere. A class-1 example, for instance, has y = (0, 1, 0).
The output layer uses softmax to turn the 3-tuple of logits into that probability distribution. Each logit is divided by a temperature T, exponentiated, and normalized so the three results sum to 1: pk = exp(zk/T) / ∑j exp(zj/T). This is the Boltzmann distribution from statistical mechanics — the logits play the role of negative energies, and the inverse temperature β = 1/T sets how peaky the distribution is. Low T (high β, “cold”) concentrates probability on the largest logit, approaching a hard argmax. High T (low β, “hot”) spreads probability toward uniform, approaching maximum entropy over the classes. This tab uses T = 1 (the standard classification default); our transformers widget makes T interactive so you can see how it shapes next-token sampling.
Softmax pairs with cross-entropy loss, which measures how far the predicted distribution p is from the true distribution y. For a single example with K classes, it is the negative log-probability the model assigned to the correct class (from p), weighted by the target probability (from y). That is, during training, for a specific input example that produces a K-class probability estimate p = (p1, …, pK), the loss for that one training input is:
L = −∑k=1..K yk log pk = −log pc
Here “log” means natural log, by convention in machine learning. Because y is one-hot, every yi = 0 except for the index i = c of the correct class where yc = 1, which is why the per-example loss collapses to L = −log pc, the negative log of the probability for the correct class. If the model assigns high probability to the correct class (pc → 1), the loss for that example’s prediction goes to 0; if it assigns low probability (pc → 0), the loss for that prediction blows up, punishing confident wrong answers hard.
Over a training batch of N examples where each example n ∈ {1, …, N} has per-example loss Ln, the training procedure uses the average per-example loss Lbatch = (1/N) ∑n=1..N Ln.
With T = 1, the softmax probabilities combined with the cross-entropy loss mean that the gradient propagated back from the output into the network simplifies cleanly to p − y (the predicted distribution minus the one-hot target) — the same elegant form sigmoid + binary cross-entropy has in Tabs 3 and 4.
Pick a hidden activation, pick a learning rate, and click Train to watch the 2-8-3 network learn a three-ring classification problem. Reset gives fresh random weights on the same problem. The interesting comparison isn’t just how fast each activation drives loss down — it’s also how differently the decision regions are shaped once training settles.
Its derivative σ′(z) = σ(z)(1−σ(z)) peaks at just 0.25 and decays to zero on both tails. Gradients are small from the start and shrink further as hidden units saturate, so loss descends sluggishly. Hidden activations are all in [0, 1], so the next layer sees a positive-mean input — a mild handicap for optimization. Decision boundaries come out smoothly curved.
Saturates the same way, but with derivative tanh′(z) = 1 − tanh²(z) peaking at 1.0 — four times sigmoid’s peak. Gradient signal is stronger near zero, so the loss drops more briskly. The zero-centered range [−1, 1] lets hidden activations balance positive and negative, which is easier on subsequent layers. Boundaries are smooth, often nearly circular on this ring problem.
Derivative is exactly 1 for z > 0 — no saturation, no gradient decay. Training typically descends fastest of the three when it works. The signature visual: boundaries are piecewise linear, straight segments meeting at kinks where individual hidden units cross zero. Watch for dead neurons, too: a unit stuck with z < 0 for every training example outputs 0 and gets zero gradient, permanently. Manifests as loss plateaus or “missing” wedges in the boundary.
| x₁ | x₂ | y |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 1 | 0 |
Training a network means adjusting each weight wkij (from the i-th neuron in layer k−1 to the j-th neuron in layer k) to reduce the loss L. The SGD update for a single weight is
wkij ← wkij − η · ∂L/∂wkij
So the one thing we really need is ∂L/∂wkij — the gradient of the loss with respect to that one weight. Backpropagation is the efficient recipe for computing it at every weight in the network.
Hover over a weight in the map below to identify it; click a weight to see how the chain rule assembles its gradient. The zoomed view updates to match.
a(z) and a′(z)| Name | a(z) | a′(z) |
|---|---|---|
| Logistic sigmoid σ | 1 / (1 + e−z) | σ(z) · (1 − σ(z)) |
| Hyperbolic tangent tanh | (ez − e−z) / (ez + e−z) | 1 − tanh2(z) |
| Rectified linear unit ReLU | max(0, z) | 1 if z > 0, else 0 |
| Linear (identity) | z | 1 |
In practice, all neurons in a given hidden layer typically use the same activation function (most often ReLU for deep networks, or tanh for shallower ones). The output layer’s activation is chosen to match the task: sigmoid for binary classification (paired with BCE loss), softmax for multi-class classification (paired with cross-entropy), or linear for regression (paired with squared error).
L and ∂L/∂ŷ| Name | L | ∂L/∂ŷ |
|---|---|---|
| Binary cross-entropy BCE | −y·log(ŷ) − (1−y)·log(1−ŷ) | (ŷ − y) / (ŷ·(1−ŷ)) |
| Squared error SE | ½(ŷ − y)2 | ŷ − y |
Simplification when pairings align. At an output neuron, the product ∂L/∂ŷ · a′(z) often collapses. BCE + sigmoid: the sigmoid's ŷ(1−ŷ) cancels BCE's denominator, giving the clean ∂L/∂z = ŷ − y. Squared error + linear output: the linear derivative is 1, so ∂L/∂z = ŷ − y too. These conveniences are why those pairings are standard.
Everything above assumes each weight appears once in the computation graph. Convolutional nets and recurrent nets deliberately share weights — the same kernel is applied at many spatial positions, or the same recurrent matrix at every time step. When a weight appears in multiple places, the chain rule naturally sums contributions from each occurrence: ∂L/∂w becomes a sum over locations rather than a single-path product. The per-location derivative is computed the same way; all those per-location gradients are added together before the SGD update.
The chain rule's three-factor shape works at every weight in every layer. What differs is how the upstream factor ∂L/∂aki is computed — for output neurons it comes directly from the loss; for hidden neurons it comes from a sum of downstream gradients. That's the whole recursion. The specific clean form ∂L/∂zKi = ŷ − y you'll see in the Backprop Walkthrough tab is an arithmetic simplification that happens when BCE pairs with sigmoid output — those two factors multiply into that tidy difference. Convenient, not structural.
The universal approximation theorem guarantees that a single hidden layer can represent any continuous function — given enough width. But UAT makes no promise about how that representation is found or how many parameters it needs. Two separate questions live below the surface:
(1) Can SGD actually find the weights? UAT is an existence proof, not a training guarantee. Sometimes the required solution sits in a region of weight-space that gradient descent doesn’t reach from typical random inits.
(2) Is the representation compact? Some functions a deep network expresses with a handful of neurons require exponentially more neurons in a shallow one. Matching parameter counts, the deeper architecture can reach solutions the shallow one simply can’t afford.
Below, three networks with identical parameter budgets (37 each) train on the same problem with the same learning rate. The story lives in the loss curves at the bottom: watch how quickly each architecture drives its loss down and how low each plateaus. On easy problems like noisy XOR all three converge similarly. On harder ones like spirals, the single-hidden-layer network often stalls while the deeper networks keep descending — same parameter budget, different training outcomes.
A convolutional network is, structurally, just an MLP with most of its weights frozen at zero and the remaining ones tied to share a single value. No new machinery — two constraints on a vanilla fully-connected network:
These constraints drop the parameter count dramatically. They also encode an inductive bias: a feature detector that helps at one position probably helps at others. This bias is true for images (translation equivariance), speech (time-shift invariance), and many other signals where what matters is the shape of a local pattern, not where it sits.
Below: a small 1D network shown in all three regimes. Toggle between them to watch the weight matrix lose entries and then tie its survivors together. Click any hidden or output neuron to see its receptive field.
Click any hidden-layer or output-layer neuron to highlight its receptive field back through the active connections. In CNN mode, shared weights share a color.
Fewer parameters earns two practical payoffs at once. First, the training burden shrinks — with eight weights to adjust instead of 142, there’s less gradient noise per update, less data needed to pin down each weight, and a much smaller loss landscape for the optimizer to search. Second, the inductive bias generalizes when the problem has the right kind of regularity. Here all three networks learn a 1-D bump-detection task on a 10-position input vector: decide whether the vector contains an elevated value somewhere. Critically, the training set only has bumps at positions 3–6 (four of the ten positions), while the test set has bumps at positions 0–2 and 7–9 (the remaining six) — positions no network has ever seen a bump at during training.
Translation equivariance (the same feature detector useful everywhere) is the inductive bias built into the CNN’s weight-sharing constraint. A CNN’s filter “bump somewhere in this window” works no matter where the window sits. The FC network has to learn position-specific detectors from scratch — and it has only seen four of ten positions in training. Local connectivity alone gets partway there.
Two panels, two stories. In the training-loss panel, notice that the CNN fits the task with just eight free parameters — the constraint doesn’t cripple learning, it focuses it, and often reaches low loss faster than the FC network whose 142 parameters give the optimizer much more surface to explore. In the test-loss panel, the generalization gap: the FC network has no reason to expect features learned at positions 3–6 to transfer to positions 0–2 or 7–9, and its test loss typically plateaus well above the CNN’s. The CNN, applying the same filter everywhere by construction, transfers for free.