Backprop Review: Chain Rule & Local Error Signals
To update any weight w, we need ∂L/∂w. Because the network is a composition of functions,
we apply the
chain rule. For a weight at step k that affects h[k+1] through
the pre-activation z = wᵣ·h[k] + wₓ·x[k]:
∂L/∂w = (∂L/∂h[k+1]) · (∂h[k+1]/∂z) · (∂z/∂w)
The middle factor ∂h[k+1]/∂z is the derivative of tanh (the chosen activation function) —
because h[k+1] = tanh(z):
∂h[k+1]/∂z = 1 − tanh²(z) = 1 − h[k+1]²
Combining the first two factors gives a
local error signal:
e[k+1] = (∂L/∂h[k+1]) · (1 − h[k+1]²)
So ∂L/∂w = e[k+1] · (∂z/∂w). The local error from one step — fully computable
from quantities known at that step — simply scales by the partial derivative of z
with respect to whatever quantity we are differentiating. Since z = wᵣ·h[k] + wₓ·x[k],
those partials are:
- Contribution to ∂L/∂wᵣ is e[k+1] · ∂z/∂wᵣ = e[k+1] · h[k]
- Contribution to ∂L/∂wₓ is e[k+1] · ∂z/∂wₓ = e[k+1] · x[k]
- Contribution to ∂L/∂h[k] (passed left to the next step) is e[k+1] · ∂z/∂h[k] = e[k+1] · wᵣ
The third bullet is the key to the recursion: h[k] is not a trainable weight, but it
connects step k to step k+1, so its gradient carries the error signal one step further left.
This pattern is recursive. Computing ∂L/∂h[k+1] itself requires the same
chain-rule step one position to the right. At the very top, the derivative we ultimately
need is ∂L/∂wᵣ or ∂L/∂wₓ (the weights that appear at every step and therefore require
the full BPTT treatment). We also update wᵧ, but its gradient ∂L/∂wᵧ = (y−y*)·h[T]
is a single term with no chain through time — so it does not illustrate the challenge
of gradient flow and is not our focus here.
For wᵣ and wₓ, the chain must reach all the way back through the h-nodes. The full
decomposition at the output step is:
∂L/∂w = (∂L/∂y) · (∂y/∂h[T]) · (∂h[T]/∂z[T]) · (∂z[T]/∂w)
where ∂L/∂y = (y−y*) is the prediction error; ∂y/∂h[T] = wᵧ because y = wᵧ·h[T]
(no activation on the output); ∂h[T]/∂z[T] = σ′(h[T]) = 1−h[T]² is the tanh derivative;
and ∂z[T]/∂w is h[T−1] for wᵣ or x[T−1] for wₓ. Grouping the first three factors
gives the first local error signal:
e[T] = (y−y*) · σ′(h[T]) · wᵧ ← error · activation deriv · weight
This is the seed that propagates leftward. At every subsequent step k, the incoming
gradient ∂L/∂h[k+1] plays the role that (y−y*)·wᵧ played here — the recursion
simply repeats: compute e[k+1], use it with h[k] and x[k] to get the weight gradient
contributions, then use e[k+1]·wᵣ = ∂L/∂h[k] (from the third line above) to seed
the next step to the left.
The ∂L/∂h[k] values shown below each node in the graph are exactly these propagated signals.
Gradient Flow Through Time: Shared Weights & Accumulation
This network uses
squared-error loss: L = ½(y − y*)². The ½ is a
convenience that cancels the exponent when differentiating: ∂L/∂y = y − y*.
As derived in the left card, the local error signal at each step is:
e[k+1] = (∂L/∂h[k+1]) · (1 − h[k+1]²)
— the gradient of the loss with respect to h[k+1] scaled by the activation derivative.
It captures how much the loss would change if h[k+1] changed, after accounting for
the nonlinearity. Summing e[k+1] terms across k is therefore summing the loss-sensitivity
at each time step, weighted by how much each step's input (h[k] or x[k]) contributed.
In a feedforward net each weight appears once, so ∂L/∂w is a single chain-rule term.
In an RNN, wᵣ and wₓ appear at
every step, so their gradients are
sums across all T steps:
∂L/∂wᵣ = Σk e[k+1] · h[k]
∂L/∂wₓ = Σk e[k+1] · x[k]
∂L/∂wᵧ = (y−y*) · h[T] (wᵧ appears once)
All T contributions must be accumulated before any weight moves.
The gradient ∂L/∂h[k] has been multiplied by wᵣ a total of T−k times on its journey
from h[T]. If |wᵣ| < 1, each multiplication shrinks it — early steps contribute almost
nothing. Arrow
color encodes magnitude:
gray=small ·
orange≈1 ·
red=large.
Each weight is then updated once:
w ← w − η · ∂L/∂w
where η is the learning rate slider.