Cross-Entropy — An Information-Theoretic View

From Regression Outputs to Class Probabilities

When you train a neural network for function approximation or linear regression, you aim to make the output of the neural network reproduce as closely as possible the input–output pairs from your training data. A natural choice of loss function for this is the mean squared error (MSE), but it is possible to imagine other ways to minimize the error between the neural network's output and an output in training data. However, it can be shown that outputs with minimal MSE also happen to be maximum-likelihood estimates when the noise in the training data is normally distributed (Gaussian noise).

But what do you do when you don't want to approximate a function (or fit a curve) but you want to classify an input as one of K different possible discrete labels? For such K-class classification, we structure the neural network to have K separate output neurons (one per class). Each neuron produces an unbounded real number called a logit score that represents how consistent the input data are with that particular class:

Logit Scores

z_k is a continuous-valued evidence variable for class k — think of it as the network's raw "vote" for that class. There is no constraint on its sign or magnitude; it can be any real number. These are exactly the kind of outputs regression already gives you.

To decide which class to predict, we can't just pick the largest z_k in a differentiable way — we need a smooth, probabilistic assignment so that gradients flow during training. The softmax converts the relative evidence across all K scores into a proper probability distribution:

Softmax with Temperature

The temperature τ > 0 controls how sharply the distribution peaks. The standard softmax uses τ = 1. The entropy tab lets you explore this directly — the temperature slider there uses exactly this formula with the same underlying logit scores.

τ↓

Low temperature — probabilities sharpen toward the highest-scoring class; in the limit τ→0 the output becomes a one-hot argmax.

τ=1

Unit temperature — the standard softmax. Model outputs reflect the raw relative magnitude of the logit scores.

τ↑

High temperature — probabilities flatten toward a uniform distribution over all classes; even a dominant logit scores barely more than the others.

We call the resulting output distribution Q. The true class is encoded as distribution P — for labeled training data this is a one-hot vector. We now need a loss that measures how far Q is from P, which is where information theory comes in.

Why not MSE on the probabilities? You could minimize ∑(p_i−q_i)² directly. It sometimes works, but it treats probabilities as plain numbers, ignores the probabilistic geometry of the problem, and — crucially — creates vanishing gradients when the softmax output saturates near 0 or 1. Information theory gives us the principled alternative, which is cross-entropy.

Entropy: Measuring Uncertainty in a Distribution

Shannon entropy H(P) measures the average amount of information (or uncertainty) in a distribution. Equivalently, it is the minimum average number of bits required to encode samples drawn from P.

Shannon Entropy

Base-2 log → bits. Base-e log → nats. Convention: 0 · log 0 = 0. Units differ only by a constant factor.

Perplexity is a closely related quantity that is often easier to interpret — it re-expresses entropy as an effective number of equally likely outcomes:

Perplexity

For a uniform distribution over K classes, perplexity = K exactly. For a one-hot distribution, perplexity = 1. You can think of perplexity as "how many choices does this distribution behave like it has?" — it is always between 1 and K, making it more concretely interpretable than entropy in bits. In language modeling, for example, a perplexity of 50 means the model's uncertainty at each step is equivalent to picking uniformly from 50 equally likely words — lower perplexity means less uncertainty, i.e., a more confident (and hopefully more accurate) model.

↑

High entropy / high perplexity — near-uniform distribution; high uncertainty; many bits needed per sample.

↓

Low entropy / low perplexity — peaked distribution; low uncertainty; you nearly know the answer before looking.

🎙 Interactive — Entropy & Perplexity Explorer (K = 4 classes)

The distribution is produced by a softmax over fixed logit scores using temperature τ. Drag left for low τ (peaked, low entropy) or right for high τ (flat, high entropy). At τ = 1 you get the standard softmax.

Temperature τ τ = 1.00

Softmax output Q over {Cat, Dog, Bird, Fish}
logit scores: z_Cat = 2.4, z_Dog = 1.1, z_Bird = 0.7, z_Fish = 0.1

H(P)

—

bits

Perplexity 2^H(P)

—

eff. classes

Max H(P)

2.000

bits (uniform)

Max perplexity

4.000

= K (uniform)

For K classes: H ranges 0–log₂(K) bits; perplexity ranges 1–K. Both hit their extremes at the same distributions (one-hot and uniform), but perplexity's scale is concrete — it always equals K for a uniform distribution over K classes.

One-Hot Entropy — The Case That Matters for Classification

When the true label is known with certainty, the entropy of P is zero. This will simplify things dramatically in Section 5.

Cross-Entropy: Measuring Decoder Quality

A neural network trained for K-class classification acts as the final component in an information channel: it decodes class-label information that has been encoded into the input data vector fed to the network.

For any given input, the distribution P represents the true probability that the input belongs to each of the K classes. In Tab 4 we will see that in standard labeled training data, P concentrates 100% of its probability on a single known label — but in principle it need not.

If the network is a perfect decoder its output distribution Q will match P exactly. If it is insensitive to the input, or has biases that make it over-confident in certain outputs, Q will deviate from P. We want a single scalar that measures how well Q matches P — specifically, how many bits per sample are needed on average if we use a code optimized for Q to encode class labels that are actually distributed as P. This is the cross-entropy H(P, Q), which degenerates into the entropy H(P) of P when Q = P. When Q is some other distribution, H(P, Q) ≥ H(P), implying that the lossy decoder has corrupted the signal and thus introduced more uncertainty than was associated with the original encoding.

Cross-Entropy

When Q = P, H(P, Q) = H(P) — no extra cost, the code is optimal. When Q ≠ P, H(P, Q) > H(P) — there is always a penalty for using the wrong code. Training the network means adjusting its weights to minimize H(P, Q), driving Q as close to P as possible. Note on units: throughout this widget we use log₂, giving units of bits. In practice, cross-entropy loss is computed with the natural logarithm ln, giving units of nats. The two differ only by a constant factor (log₂ e ≈ 1.44), so the loss surface and its gradients are identical up to rescaling.

Connection to KL divergence: The gap H(P, Q) − H(P) is called the Kullback–Leibler divergence KL(P‖Q) — the extra bits wasted by using Q's code instead of P's optimal code. Because H(P) depends only on the data and not on the model, minimizing H(P, Q) over the network weights is identical to minimizing KL(P‖Q). This equivalence also connects cross-entropy training directly to maximum likelihood estimation. You will encounter KL divergence frequently in probabilistic machine learning, variational inference, and information theory.

The Classification Case: One-Hot Labels

In standard supervised classification, the true distribution P is a one-hot vector: probability 1 on the correct class k*, zero everywhere else.

One-Hot Label

Plugging this one-hot P into the cross-entropy formula, every term with p_i = 0 vanishes, leaving only the single term for the correct class k*:

Classification Cross-Entropy Loss (per sample)

The loss is the negative log-probability the model assigns to the correct class. Minimizing this loss pushes q_k* toward 1. Note that when the decoder is perfect (q_k* = 1), H(P, Q) = −log₂(1) = 0 — which equals H(P), the entropy of the deterministic one-hot distribution. Any deviation from perfect classification inflates the loss.

🎙 Interactive — One-Hot Loss (K = 4)

Assume a training example's true class is Cat (i.e., k* corresponds to the Cat label). Use the slider to adjust the probability q_Cat that the network assigns to the correct class. The remaining probability is split equally among the other classes.

q_Cat 0.50

Loss = −log₂ q_Cat

—

bits

Intuition: If the model assigns q_Cat = 0.90 and the true class is Cat, the loss is −log₂(0.90) ≈ 0.15 bits — small. If q_Cat = 0.02, the loss is −log₂(0.02) ≈ 5.6 bits — large. The logarithm grows without bound as q_k* approaches zero, providing a strong gradient signal exactly when the model is most confidently wrong.

Binary Cross-Entropy (BCE)

For binary classification (K = 2), the true label is either class 0 or class 1. Using the notation from Tab 3, the true distribution P is fully described by a single number y ∈ {0, 1}: P = [y, 1−y]. The network produces a single probability via a sigmoid activation, which we write ŷ = q₁, giving Q = [ŷ, 1−ŷ]. Plugging these directly into the cross-entropy formula:

Binary Cross-Entropy

This is H(P, Q) with K=2, P = [y, 1−y], Q = [ŷ, 1−ŷ], written using the natural logarithm (nats) as is standard in practice. The y and (1−y) factors act as switches — only one term survives per sample: when y = 1, the loss is −ln(ŷ); when y = 0, it is −ln(1−ŷ).

🎙 Interactive — BCE Loss Curves

Select the true label y, then drag ŷ. The gold dot tracks your position on the active curve.

True label y y = 1 y = 0

Model output ŷ 0.50

y = 1: −ln(ŷ)

y = 0: −ln(1−ŷ)

Current (ŷ, loss)

y (true label)

ŷ (predicted prob.)

0.500

BCE Loss

—

nats

MSE vs. Cross-Entropy — Why the choice matters

Property	MSE Loss	Cross-Entropy Loss
Appropriate for	Continuous regression targets	Probability outputs (classification)
Output activation	Linear	Sigmoid (K=2) or Softmax (K>2)
Gradient at saturation	Vanishes — multiplied by σ'(z) ≈ 0	Linear in error — no vanishing gradient
Probabilistic meaning	MLE under Gaussian likelihood	MLE under Bernoulli / Categorical likelihood
Information-theoretic meaning	None direct	Minimizes KL divergence from data to model

The gradient advantage in detail: For a logistic unit with sigmoid σ(z) and BCE loss, ∂L/∂z = ŷ − y — clean and linear in the prediction error. With MSE instead, the gradient picks up a factor of σ'(z) = ŷ(1−ŷ), which vanishes when ŷ ≈ 0 or ŷ ≈ 1 — exactly the regime where a confident-but-wrong model lives, and where you most need a strong corrective signal. Cross-entropy cancels this saturation effect entirely.