Recurrent Networks & Temporal Supervision Explorer

© 2026 Theodore P. Pavlic  ·  MIT License
Shift Register & Fan-in to Output  (k = 3)
Input x[k] and Output y[k] — drag to inspect any time step
Register & Feedback (k = 3)
Input x[k] and Output y[k] — drag to inspect, click x[k] to toggle
System Diagram  (k = 3)
x[k], h₁[k], h₂[k], y[k] over time — drag to inspect, click x[k] to toggle
State-space trajectory: h₁[k] vs h₂[k]  (k = 0–50; x[k] held at last value for k > 20)
System Update Equations
h[k+1] = A·h[k] + B·x[k]
y[k] = Cᵀ·h[k]

A transforms the 2D hidden state each step. B injects the scalar input x[k] into h. Cᵀ projects the hidden state to a scalar output y[k].

Apply tanh to h[k+1] wraps the update: h[k+1] = tanh(A·h[k] + B·x[k]). This does two things: it bounds h to (−1, 1) regardless of ρ(A), and it nonlinearly mixes the input history at each step — creating richer representations that can be more separable than the purely linear case. Without tanh, the system is a linear map and h grows without bound whenever ρ > 1.

Overflow guard: without tanh, h can grow without bound when ρ(A) > 1. The display clamps h to ±15 per step to prevent NaN — it does not change the mathematics.
Modes for the A Matrix
  • Scaled rotation α·R(θ) — A is constrained to rotate the hidden state by θ radians and scale it by α each step. Spectral radius ρ = α. Clean parameterization but limited to orthogonal-like structures.
  • Free (4 entries) — all four entries of A can be set independently, enabling structures a rotation cannot represent: Jordan blocks, shears, asymmetric coupling, double integrators.
  • Random A (scaled to ρ) — a random 2×2 matrix is drawn and rescaled so its spectral radius equals the target ρ. This mimics how an Echo State Network reservoir is initialized: random, frozen, and scaled to just below ρ = 1.
ρ(A) and the Echo State Property
The spectral radius ρ(A) is the magnitude of the largest eigenvalue of A. It controls the long-run behavior of the hidden state:
  • ρ < 1 — h eventually forgets any initial condition and reflects only recent inputs. This is the echo state property: the reservoir is driven entirely by the input signal.
  • ρ = 1 — h neither grows nor decays. Memory is indefinitely long but fragile.
  • ρ > 1 — h grows without bound. Initial conditions dominate; the echo state property is lost.
In a real ESN, A is large (hundreds of neurons), frozen after random initialization, and only the output weights Cᵀ are trained — by simple linear regression, not backprop. This widget is a 2D preview of that reservoir idea.

Try this: click Random A with ρ ≈ 0.9, then cycle through the four input signals (step, impulse, 01 alt, square). Watch how the state-space trajectory changes shape and location for each signal — the cloud of (h₁, h₂) points lands in a different region of state space depending on the temporal structure of the input. The random mixing of A is sufficient to create this separation even without tanh. This is exactly why ESNs can classify time series: different patterns produce geometrically separable reservoir states, and a linear readout (Cᵀ) can then separate them.

Now repeat the experiment with a rotation-mode A (e.g. Oscillatory preset). The state-space trajectory traces the same orbit regardless of input signal — only the position along the orbit changes. The marginal distributions of h₁ and h₂ look nearly identical across all inputs, making linear classification much harder. The random A breaks this symmetry and creates the rich, input-dependent geometry that makes reservoir computing useful.
Unrolled Computation Graph
Gradient Accumulation — ∂L/∂wᵣ = Σ e[k+1]·h[k]    ∂L/∂wₓ = Σ e[k+1]·x[k]    where e[k+1] = (∂L/∂h[k+1])·σ′
Backward pass not yet run — step through or press Full Pass to see accumulation.
Loss over Training Iterations
Backprop Review: Chain Rule & Local Error Signals
To update any weight w, we need ∂L/∂w. Because the network is a composition of functions, we apply the chain rule. For a weight at step k that affects h[k+1] through the pre-activation z = wᵣ·h[k] + wₓ·x[k]:

∂L/∂w = (∂L/∂h[k+1]) · (∂h[k+1]/∂z) · (∂z/∂w)

The middle factor ∂h[k+1]/∂z is the derivative of tanh (the chosen activation function) — because h[k+1] = tanh(z):
∂h[k+1]/∂z = 1 − tanh²(z) = 1 − h[k+1]²

Combining the first two factors gives a local error signal:
e[k+1] = (∂L/∂h[k+1]) · (1 − h[k+1]²)

So ∂L/∂w = e[k+1] · (∂z/∂w). The local error from one step — fully computable from quantities known at that step — simply scales by the partial derivative of z with respect to whatever quantity we are differentiating. Since z = wᵣ·h[k] + wₓ·x[k], those partials are:
  • Contribution to ∂L/∂wᵣ  is  e[k+1] · ∂z/∂wᵣ  =  e[k+1] · h[k]
  • Contribution to ∂L/∂wₓ  is  e[k+1] · ∂z/∂wₓ  =  e[k+1] · x[k]
  • Contribution to ∂L/∂h[k]  (passed left to the next step)  is  e[k+1] · ∂z/∂h[k]  =  e[k+1] · wᵣ
The third bullet is the key to the recursion: h[k] is not a trainable weight, but it connects step k to step k+1, so its gradient carries the error signal one step further left.

This pattern is recursive. Computing ∂L/∂h[k+1] itself requires the same chain-rule step one position to the right. At the very top, the derivative we ultimately need is ∂L/∂wᵣ or ∂L/∂wₓ (the weights that appear at every step and therefore require the full BPTT treatment). We also update wᵧ, but its gradient ∂L/∂wᵧ = (y−y*)·h[T] is a single term with no chain through time — so it does not illustrate the challenge of gradient flow and is not our focus here.

For wᵣ and wₓ, the chain must reach all the way back through the h-nodes. The full decomposition at the output step is:
∂L/∂w = (∂L/∂y) · (∂y/∂h[T]) · (∂h[T]/∂z[T]) · (∂z[T]/∂w)
where ∂L/∂y = (y−y*) is the prediction error; ∂y/∂h[T] = wᵧ because y = wᵧ·h[T] (no activation on the output); ∂h[T]/∂z[T] = σ′(h[T]) = 1−h[T]² is the tanh derivative; and ∂z[T]/∂w is h[T−1] for wᵣ or x[T−1] for wₓ. Grouping the first three factors gives the first local error signal:
e[T] = (y−y*) · σ′(h[T]) · wᵧ   ← error · activation deriv · weight
This is the seed that propagates leftward. At every subsequent step k, the incoming gradient ∂L/∂h[k+1] plays the role that (y−y*)·wᵧ played here — the recursion simply repeats: compute e[k+1], use it with h[k] and x[k] to get the weight gradient contributions, then use e[k+1]·wᵣ = ∂L/∂h[k] (from the third line above) to seed the next step to the left. The ∂L/∂h[k] values shown below each node in the graph are exactly these propagated signals.
Gradient Flow Through Time: Shared Weights & Accumulation
This network uses squared-error loss: L = ½(y − y*)². The ½ is a convenience that cancels the exponent when differentiating: ∂L/∂y = y − y*.

As derived in the left card, the local error signal at each step is: e[k+1] = (∂L/∂h[k+1]) · (1 − h[k+1]²) — the gradient of the loss with respect to h[k+1] scaled by the activation derivative. It captures how much the loss would change if h[k+1] changed, after accounting for the nonlinearity. Summing e[k+1] terms across k is therefore summing the loss-sensitivity at each time step, weighted by how much each step's input (h[k] or x[k]) contributed.

In a feedforward net each weight appears once, so ∂L/∂w is a single chain-rule term. In an RNN, wᵣ and wₓ appear at every step, so their gradients are sums across all T steps:
∂L/∂wᵣ = Σk e[k+1] · h[k]
∂L/∂wₓ = Σk e[k+1] · x[k]
∂L/∂wᵧ = (y−y*) · h[T]  (wᵧ appears once)


All T contributions must be accumulated before any weight moves.

The gradient ∂L/∂h[k] has been multiplied by wᵣ a total of T−k times on its journey from h[T]. If |wᵣ| < 1, each multiplication shrinks it — early steps contribute almost nothing. Arrow color encodes magnitude: gray=small · orange≈1 · red=large. Each weight is then updated once: w ← w − η · ∂L/∂w where η is the learning rate slider.
Training Objective & the T=6 Window
The network is trained to produce a target output y* at step T=6 given the input sequence x[0..5]. The three weights — wᵣ (recurrent), wₓ (input), wᵧ (output) — are all updated by gradient descent after each forward+backward pass.

The choice of T matters: longer T lets the network use earlier inputs but makes gradients harder to propagate (more multiplications by wᵣ). In practice, truncated BPTT uses a fixed window (like T=6 here) that slides through long sequences, trading some long-range memory for computational tractability.

σ (sigma) is the activation function applied at each step — here σ = tanh. Its derivative σ′ = 1−h² appears at every backward step. When h≈±1 (saturated), σ′≈0, compounding with the wᵣ factor to make vanishing gradients even worse.
Rolled Network (What It Actually Looks Like)
The unrolled graph above is a teaching device — the actual network has a single hidden node h with a self-loop. The same three weights are reused at every step. h[k] wᵣ x[k] wₓ y[k] wᵧ h[k+1] = tanh(wᵣ·h[k] + wₓ·x[k]) Unrolling copies this diagram T times with shared weights, creating the chain above.
BPTT vs. Reservoir / ESN Training
BPTT trains all weights — wᵣ, wₓ, and wᵧ — by propagating gradients back through the unrolled chain. This is expensive and unstable when gradients vanish or explode.

Echo State Networks (ESNs) sidestep this entirely. The recurrent matrix A (and thus wᵣ here) is fixed at random initialization and never trained. Only the readout weights (wᵧ / Cᵀ) are trained — by simple linear regression, not backprop. This works because a well-chosen random reservoir with ρ(A) < 1 already projects the input history into a rich, separable state space (as you can see in the Tab 3 state-space trajectory).

The tradeoff: ESNs are fast and stable but the reservoir is fixed. BPTT-trained RNNs can learn task-specific dynamics but require gradient clipping or LSTMs to train reliably.
Vanilla RNN (from Tab ④)
h[k] wᵣ x[k] wₓ y[k] wᵧ h[k+1] = tanh(wᵣ·h[k] + wₓ·x[k]) gradient ∝ (wᵣ·σ′)T → vanishes/explodes

Every backward step multiplies the gradient by wᵣ·σ′ — once per step. After T steps this factor has been applied T times. If |wᵣ|<1 the gradient vanishes; if |wᵣ|>1 it explodes.

LSTM Cell — adding a gradient highway
c — CELL STATE (gradient highway — additive updates) c[k] = f⊙c[k−1] + i⊙g · g = tanh(W·[h[k−1]; x[k]]) · f, i = σ(W·[h[k−1]; x[k]]) c[k−1] × + c[k] × f σ i σ g tanh o σ h[k−1] x[k] tanh(c) × h[k] h — HIDDEN STATE (output each step; feeds back as h[k−1] in the next cell): h[k] = o⊙tanh(c[k]) · o = σ(W·[h[k−1]; x[k]])

The cell state c flows across the top with only additive modifications — no tanh squashing, no fixed wᵣ multiplication. Gradients traveling back through c are multiplied by the forget gate f (learned, ≈1 for things worth remembering) rather than by wᵣ·σ′.

The Gradient Highway (Cell State)
In the vanilla RNN, the gradient at h[k] is obtained by multiplying the gradient at h[k+1] by wᵣ·σ′(h[k+1]) at every step — the same factor repeated T−k times.

The LSTM introduces a second state variable c[k] that is updated additively: c[k] = f⊙c[k−1] + i⊙g. The gradient of the loss flowing back through c is: ∂L/∂c[k] = (∂L/∂c[k+1]) · f[k+1].

Because f is a learned sigmoid output (between 0 and 1), the network can learn to keep f≈1 for information it needs to retain — making the gradient factor close to 1 step after step, rather than shrinking by wᵣ·σ′<1 at every step. This is the core insight of LSTM: replace the fixed, uncontrolled wᵣ multiplication with a learnable, task-specific gating factor.
The Four LSTM Gates
All four gates receive the same concatenated input [h[k−1]; x[k]], each with its own learned weight matrix:

f = σ(Wf·[h[k−1]; x[k]])  (forget gate)
i = σ(Wi·[h[k−1]; x[k]])  (input gate)
g = tanh(Wg·[h[k−1]; x[k]])  (cell gate / candidate)
o = σ(Wo·[h[k−1]; x[k]])  (output gate)


f (forget gate) — sigmoid output in (0,1). Decides what fraction of the old cell state c[k−1] to keep. f≈0 erases; f≈1 preserves.

i (input gate) — sigmoid, controls how much of the candidate g to write into the cell state.

g (cell gate / candidate) — tanh output in (−1,1). The proposed new content to write into c. Added as i⊙g.

o (output gate) — sigmoid, controls how much of tanh(c[k]) is exposed as the hidden state h[k].

Cell state: c[k] = f⊙c[k−1] + i⊙g
Hidden state: h[k] = o⊙tanh(c[k])
GRU: Simplified Gated Alternative to LSTM
The Gated Recurrent Unit (GRU) achieves similar gradient-flow benefits with fewer parameters by merging the forget and input gates into a single update gate z, and eliminating the separate cell state:

z = σ(W·[h[k−1]; x[k]])  (update gate)
r = σ(W·[h[k−1]; x[k]])  (reset gate)
g = tanh(W·[r⊙h[k−1]; x[k]])
h[k] = (1−z)⊙h[k−1] + z⊙g


z (update gate) — combines LSTM's forget and input gates into one: (1−z) keeps the old state, z writes new content. z≈0 copies h[k−1] unchanged; z≈1 replaces it with g. The network learns when to update and when to hold.

r (reset gate) — controls how much of h[k−1] is visible when computing the candidate g. r≈0 ignores history (fresh start); r≈1 uses it fully. This has no direct LSTM equivalent — it lets the GRU selectively forget past state when computing new content without a separate cell state variable.

g (candidate) — like LSTM's cell gate, but conditioned on r⊙h[k−1] instead of h[k−1] directly. The reset gate decides how much past context shapes the proposed new content.

The update h[k] = (1−z)⊙h[k−1] + z⊙g has the same additive structure as the LSTM cell state — gradient flows back through (1−z), which can stay near 1. GRUs are faster to train and often match LSTM performance on shorter sequences.
Training LSTMs: Same BPTT, Better Gradients
LSTMs are trained with exactly the same BPTT algorithm you saw in Tab ④ — forward pass, then chain-rule gradients backward through the unrolled computation graph.

The improvement is not algorithmic but structural: the gradient path through c[k] involves multiplication by f rather than by wᵣ·σ′, and f is learned to be favorable. In practice this means gradients can usefully propagate hundreds of steps rather than vanishing after ~10.

The gradient still vanishes eventually (if f<1 consistently) or can explode (gradient clipping is still routinely used). LSTMs are not a complete solution — they are a significantly better initialization point for the same underlying optimization problem.
LSTM vs ESN: Two Ends of a Spectrum
The three architectures from Tabs ③, ④, and ⑤ represent a spectrum of how the recurrent weights are treated:

ESN (Tab ③) — recurrent weights A fixed at random init, only the readout Cᵀ is trained (linear regression, no backprop). Fast, stable, but capacity is limited by the fixed reservoir.

Vanilla RNN (Tab ④) — all weights trained by BPTT. Full expressiveness in principle, but the gradient problem makes this hard in practice.

LSTM — all weights trained by BPTT, but the gated cell state provides a controlled gradient path. The gates themselves add parameters and expressiveness — the network learns not just what to compute but what to remember.

For many sequence tasks where the relevant context spans long time windows, LSTMs or GRUs are the practical choice. For tasks where a well-seeded random projection suffices, ESNs remain faster and more interpretable.