Transformer Architecture Explorer

Transformer Architecture

Click any block in the encoder diagram to learn what it does. Despite its power, a transformer is fundamentally a feedforward neural network — at inference it has no persistent internal state, and its only "memory" is whatever fits in the current context window. What makes it distinctive is how it mixes information: rather than compressing history into a hidden state (RNN) or restricting interactions to a local patch (CNN), every token can directly attend to every other token in a single parallel step. The architecture's inductive bias is deliberately weak — it makes almost no assumptions about where relevant information will be — which is both its strength and the reason it needs large amounts of training data to work well. See Tab ② for a deeper comparison with CNNs and RNNs.

Why "transformer"? Unlike an RNN, which passes context through a fixed-size hidden state sequentially, a transformer directly connects every token to every other token in a single step. Each encoder layer transforms the entire set of token representations in parallel, enabling rich, long-range contextual mixing from the very first layer — hence the name. The transformation at each layer is what's distinctive: global, parallel, and context-mixing, rather than sequential (RNN) or local (CNN).

Hybrid transformers: when recurrence meets attention. Although this explainer portrays transformers as purely feedforward networks, some architectures incorporate recurrent components — GRUs, LSTMs, or other RNN variants — inside or alongside the transformer layers. Examples include Transformer-XL, which adds a recurrent memory segment so context can flow across multiple consecutive chunks of text (extending the effective context window), and the Universal Transformer, which applies the same transformer layer recurrently across depth rather than stacking separate fixed layers.

An important clarification: the "time" axis in these models is token position within a single input sequence, not real time across separate inferences. When an RNN processes a sequence inside a transformer, it sees token 1, then token 2, then token 3 — coupling positions within one forward pass. The state is reset between inferences; there is no persistent memory that carries over from one user prompt to the next (unless the model is explicitly given prior conversation history as part of its input, as is done in chat applications by concatenating the conversation into the context window).

The inductive bias of a hybrid is a blend: the transformer's global attention can attend to any position in a single step, while the recurrent component adds an explicit preference for sequential order and local continuity. On tasks where token order has strong short-range dependencies — certain time-series tasks, music, or fine-grained syntactic parsing — this added inductive bias can be an advantage. On tasks that benefit from purely global, order-agnostic mixing, the added recurrence may be redundant or harmful. As with CNNs vs. ViTs, the right architecture depends on whether the task's structure matches the architecture's assumptions.

Input ("Context") Tokens

Token Embeddings

Positional Encoding

× L layers

Multi-Head Self-Attention

↓ residual skip connection

Add & Norm (1)

Cross-Attention (optional)

encoder-decoder only

Feed-Forward Network (FFN)

↓ residual skip connection

Add & Norm (2)

Output Representations

← Click a block to explore it

Encoder · Decoder · Encoder-Decoder. The diagram above shows a transformer encoder (BERT-style): every token attends to every other token, making it ideal for understanding tasks. A decoder (GPT-style) uses a causal mask so each token can only attend to tokens that came before it — enabling autoregressive generation (see Tab ③). A full encoder-decoder model (the original "Attention Is All You Need" architecture, T5, BART) uses both halves connected by a cross-attention bridge, letting a generated output sequence attend to an encoded input sequence.

Key dimensions — used throughout this explorer: d (= d_model) — the dimensionality every token's representation lives in at every layer. Input embeddings, layer outputs, and the FFN all operate in ℝ^d. BERT-base: d = 768; GPT-3: d = 12,288.
d_k — the smaller dimension each attention head projects down to for computing Q and K. Typically d_k = d / h, where h is the number of heads. BERT-base: d_k = 64. The √d_k scaling in softmax(Q·K^T/√d_k) normalises dot-product variance in this space.
d_v — the value projection dimension, also typically d / h. After all h heads are combined by W_O, the output is projected back to ℝ^d.
h — number of attention heads. BERT-base: h = 12. More heads allow the model to attend to multiple relationship types simultaneously.

Together, d and h determine the richness of the embedding space: d sets the total capacity, while h partitions it into h parallel d_k-dimensional subspaces, each acting as an independent "viewpoint." This is structurally analogous to how receptive fields tile representational space in other architectures — a CNN filter responds to a localized spatial pattern, and an RBF neuron fires for inputs near its center. An attention head similarly specializes to a particular relational pattern (subject→verb, modifier→entity), with the temperature controlling how sharply it "fires." The difference is that CNN and RBF receptive fields are defined explicitly by geometry or a center point, while attention head specialization is entirely emergent from training. More concretely: the number of receptive fields in a CNN or RBFNN (how many distinct patterns the network can detect simultaneously) is analogous to h; the size and flexibility of each receptive field (how rich a pattern each unit can represent) is analogous to d_k. A larger d_k gives each head more expressive capacity — a wider, more flexible "receptive field" in the relational sense — while more heads h allows the model to track more distinct relationship types in parallel.

Parameter Count at a Glance

For a standard encoder with vocabulary size $|𝒱|$ (number of distinct tokens), model dimension $d$ , $L$ layers, and learned positional encodings up to sequence length $L max$ :

Total ≈ |𝒱|·d + L_max·d + L·12d² Token embeddings: |𝒱|·d (|𝒱| = vocab size) Positional encodings: L_max·d (if learned; zero if sinusoidal; L_max = max sequence length) Per encoder layer: 4d² (MHSA: W_Q, W_K, W_V, W_O) + 8d² (FFN: W_1, W_2, including biases) + 4d (LayerNorm γ,β ×2 — negligible) ≈ 12d² per layer

The $L\cdot12d²$ term dominates for large models — parameter count scales with depth and the square of model dimension. Doubling $d$ quadruples per-layer cost.

Model	\|𝒱\|	d	N	h	~Params
BERT-base	30k	768	12	12	~110M
BERT-large	30k	1024	24	16	~340M
GPT-2	50k	768	12	12	~117M
GPT-3	50k	12288	96	96	~175B

BERT-base check: 30k×768 + 12×12×768² ≈ 23M + 85M ≈ 108M ✓

Transformer Topologies

Encoder Only

↕ bidirectional self-attention

Input Tokens

↓

Encoder Layer ×Neach token attends to all others

↓

Contextual Representations

The output is a rich contextual vector for every token — not a generated word. These vectors encode meaning geometrically: similar meanings land nearby in the vector space, so a query and a relevant page will have similar representations even if they share no keywords. For more demanding tasks (e.g., Q&A), the query and candidate passage can be concatenated as a single input so their tokens attend to each other directly, and the resulting [CLS] vector is trained to score how well they match.

BERT · RoBERTa · ELECTRA

📌 Google Search (2019) — BERT transformed how Google understands query intent and page meaning, one of the largest NLP deployments in history

Classification · NER · Embeddings · Extractive QA

Decoder Only

causal attention — left tokens only
(each token predicts the next; can't peek right)

Tokens generated so far

↓

Decoder Layer ×Ntoken attends to left tokens only

↓

Next Token Distribution

Each forward pass predicts the next token given all previous ones, appends it, then repeats. At training time this is self-supervised (the sequence is its own label); at inference time it generates. There is no separate "task head" — the generation IS the output.

GPT-2/3/4 · LLaMA · Mistral · Claude

📌 ChatGPT / Claude / Copilot — virtually every modern AI assistant and code completion tool uses this architecture

Text generation · Chat · Code · Language modeling

Encoder–Decoder

cross-attention bridge

Source tokens

↓

Encoder ×N

→ cross-attn

Target tokens

↓

Decoder ×N

↓

Next Token Distribution

The encoder runs first on the full source sequence, producing N precomputed output vectors — one per source token. These are then held fixed. The decoder generates the output sequence token by token, with causal self-attention over the target tokens generated so far, plus cross-attention that queries into the encoder's precomputed vectors. Source and target tokens never occupy the same sequence — they only interact through the cross-attention lookup.

T5 · BART · "Attention Is All You Need"

📌 Google Translate & DeepL — the encoder reads the source language, the decoder generates the translation token by token

Translation · Summarization · Seq2Seq · Document Q&A

Inference in Practice: Concrete Examples

Regardless of architecture, a transformer always does the same core thing: it takes N input tokens and produces N output vectors in ℝ^d, one per token — what differs is which vectors are used, and how.

N can be very large. In a chat application, every prior message in the conversation is part of the context — by the time you are deep into a long conversation, N might be tens of thousands of tokens. GPT-4 supports up to 128,000 tokens; some models go beyond 1 million. Every one of those tokens is present in the forward pass, which is why long contexts are expensive and why the quadratic cost of attention (the N×N matrix) is such an active research problem.

Encoder Only

[CLS] the cat sat on the mat
N = 7 tokens in ([CLS] prepended)

↓

Encoder (L layers)
full bidirectional attention

↓

7 vectors in ℝ^d
N vectors out — all at once

↓

Task head:
• [CLS] vector → classifier (sentiment, topic)
• All vectors → label per token (Named Entity Recognition: person, location, etc.)
• Query+passage → relevance score (e.g., for search)

Decoder Only

"the cat sat"
N = 3 tokens so far

↓

Decoder (L layers)
causal mask: left tokens only

↓

3 vectors in ℝ^d
each predicts its next token
only the last vector is output

↓

→ sample "on" → append → repeat
→ sample "the" → append → repeat
→ sample "mat" → append → repeat
→ sample [EOS] → stop
[EOS] is a learned special token — the model generates it when the output is complete
💡 K and V for previous tokens are cached — only the new token's Q, K, V are computed each step

Encoder–Decoder

"le chat s'est assis"
source: N tokens — encoder runs first

↓

Encoder → N vectors
precomputed once; held fixed

↕ cross-attention

"the cat sat"
target so far: M = 3 tokens

↓

Decoder (L layers)
causal self-attn + cross-attn to source

↓

3 vectors in ℝ^d
each predicts its next token
only the last vector is output

↓

Transformer Forward Pass: Pseudocode

Toggle between the three topologies. All share the same core layer — what differs is the attention mask, the optional cross-attention sub-block, and how the output is used. The encoder-decoder variant makes the cross-attention mechanism from the Architecture tab concrete.

# Encoder-only forward pass (BERT-style)
# Input: token indices [N]   Output: N contextual vectors in R^d
# @ denotes matrix multiplication (as in NumPy/PyTorch)
# .T denotes matrix/vector transpose (M.T is Mᵀ, as in NumPy/PyTorch)

def encoder_forward(tokens, E, W_pos, layers):
    x = E[tokens] + W_pos[:len(tokens)]  # embed + positional encoding  (N, d)

    # For each of L encoder layers:
    for (W_Q, W_K, W_V, W_O, W_1, W_2, b_1, b_2) in layers:

        # — Multi-Head Self-Attention (all tokens attend to all tokens) —
        Q, K, V  = x@W_Q, x@W_K, x@W_V               # (N,d_k), (N,d_k), (N,d_k)
        attn_out = softmax(Q@K.T/sqrt(d_k)) @ V      # (N,d_k) — no mask
        x = layer_norm(x + attn_out @ W_O)           # Add & Norm (1)

        # — Feed-Forward Network (per token) —
        x = layer_norm(x + relu(x@W_1+b_1)@W_2+b_2)  # Add & Norm (2)

    return x  # (N, d) — one contextual vector per input token

# Task heads:  classification → linear(x[0])   token labels → linear(x)

# Decoder-only forward pass (GPT-style)
# Input: token indices [N]   Output: probability distribution over vocab
# @ denotes matrix multiplication (as in NumPy/PyTorch)
# .T denotes matrix/vector transpose (M.T is Mᵀ, as in NumPy/PyTorch)

def decoder_forward(tokens, E, W_pos, layers):
    x    = E[tokens] + W_pos[:len(tokens)]           # (N, d)
    mask = upper_triangular(-inf, size=len(tokens))  # causal mask (N, N)

    # For each of L decoder layers:
    for (W_Q, W_K, W_V, W_O, W_1, W_2, b_1, b_2) in layers:

        # — Causal Self-Attention (each token sees only its past) —
        Q, K, V  = x@W_Q, x@W_K, x@W_V
        attn_out = softmax(Q@K.T/sqrt(d_k) + mask) @ V  # future → -inf → 0
        x = layer_norm(x + attn_out @ W_O)

        # — Feed-Forward Network —
        x = layer_norm(x + relu(x@W_1+b_1)@W_2+b_2)

    logits = x[-1] @ E.T     # project last token only (re-use E — weight tying)
    return softmax(logits)   # probability over next token

# Inference: sample(decoder_forward(tokens)) → append → repeat until [EOS]
# KV cache:  store K,V for prior tokens; recompute only for the new token

# Encoder-Decoder forward pass (T5/BART-style)
# source_tokens: source sequence [N]   target_tokens: target so far [M]
# @ denotes matrix multiplication (as in NumPy/PyTorch)
# .T denotes matrix/vector transpose (M.T is Mᵀ, as in NumPy/PyTorch)

def enc_dec_forward(source_tokens, target_tokens, E, W_pos, enc_layers, dec_layers,
                    W_Q_x, W_K_x, W_V_x, W_O_x):  # cross-attn weights (part of decoder)

    # ── Step 1: Encode source (run once; output held fixed during decoding) ──
    enc = E[source_tokens] + W_pos[:len(source_tokens)]  # (N, d)

    # For each of L encoder layers:
    for (W_Q, W_K, W_V, W_O, W_1, W_2, b_1, b_2) in enc_layers:
        Q, K, V = enc@W_Q, enc@W_K, enc@W_V
        enc = layer_norm(enc + softmax(Q@K.T/sqrt(d_k))@V @ W_O)
        enc = layer_norm(enc + relu(enc@W_1+b_1)@W_2+b_2)
    # enc is now (N, d) — precomputed source representations, held fixed

    # ── Step 2: Decode target token by token ──
    x    = E[target_tokens] + W_pos[:len(target_tokens)]   # (M, d)
    mask = upper_triangular(-inf, size=len(target_tokens)) # causal mask (M, M)

    # For each of L decoder layers:
    for (W_Q, W_K, W_V, W_O, W_1, W_2, b_1, b_2) in dec_layers:

        # — Causal Self-Attention over target tokens —
        Q, K, V = x@W_Q, x@W_K, x@W_V
        x = layer_norm(x + softmax(Q@K.T/sqrt(d_k)+mask)@V @ W_O)

        # — Cross-Attention: Q from decoder, K and V from encoder output —
        Q_x = x   @ W_Q_x  # (M, d_k) — queries from target side
        K_x = enc @ W_K_x  # (N, d_k) — keys from source (precomputed)
        V_x = enc @ W_V_x  # (N, d_k) — values from source (precomputed)
        x = layer_norm(x + softmax(Q_x@K_x.T/sqrt(d_k))@V_x @ W_O_x)
        # Q_x@K_x.T is (M×N): every target query scores against every source key

        # — Feed-Forward Network —
        x = layer_norm(x + relu(x@W_1+b_1)@W_2+b_2)

    logits = x[-1] @ E.T    # project last target vector to vocab
    return softmax(logits)  # sample next target token → append → repeat until [EOS]

Key References

Vaswani, A. et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 5998–6008. NeurIPS proceedings
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019, pp. 4171–4186. ACL Anthology N19-1423
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI technical report. — GPT-2; no formal venue. OpenAI PDF
Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). NeurIPS proceedings
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. — arXiv preprint; no formal conference venue. arXiv:1907.11692

Attention Mechanism

Self-attention lets each token "query" all other tokens to build a contextual representation. All matrix operations shown below use real (toy) numbers computed in your browser — no fake data.

What is an attention head, concretely? A head is nothing more than its own set of three weight matrices: W_Q, W_K, and W_V (each d×d_k). That's the entire implementation — there is no other structure. For h heads, the model has h×3 such matrices. They are initialized randomly and trained end-to-end; the "concept" a head represents (subject↔verb, determiner↔noun, etc.) is not specified in advance — it emerges from gradient descent because specialization turns out to be useful. In this sense each head is loosely analogous to a receptive field in a CNN or RBF network: h controls how many distinct relational patterns the model can detect simultaneously, and d_k controls how expressive (how "wide") each head's receptive field is. Crucially, however, an attention head's receptive field need not be local — it is defined entirely in the learned relational space of Q and K, and can connect any two positions regardless of their distance or order in the sequence. This is the key difference from CNNs (whose filters are anchored in spatial neighborhoods) and RBF networks (whose neurons fire for inputs near a fixed center point): those architectures have a strong locality inductive bias built in by design, while a transformer head must learn whatever spatial or relational structure is relevant entirely from data. After training, a single shared W_O matrix (hd_k×d) projects the concatenated outputs of all heads back into the original d-dimensional space.

How to use the controls below: The sentence is the full input to the transformer — all tokens are processed simultaneously, not one at a time. The attention weights reveal how the model internally relates each token to every other token in the context of that sentence. Select a head, then click a token in the sentence to see, from that token's perspective, which other tokens it attends to most strongly under the concept that head has learned. Switching heads shows how different learned subspaces pick out different relationships within the same sentence.

Note: the four heads here are hand-crafted illustrations of the kinds of relationships that might emerge from training — not a guarantee of what will. Real trained transformers may carve up the space quite differently, and individual heads often don't correspond cleanly to a single human-interpretable concept. For a complete pseudocode implementation of the attention mechanism in context, see the Transformer Forward Pass: Pseudocode section on Tab ①.

Sentence (hypothetical transformer input)

Attention head (modeled concept relationships among tokens)

Tokens & Embeddings

Click a token to set it as the query — you are selecting a viewpoint within the sentence, not entering a search term. The heatmap and score bars then show how strongly that token attends to every other token in the sentence under the currently selected head.

Query Token: Scores & Weights

For the selected query token, raw dot-product scores (Q·K / √d_k) and their softmax-normalized attention weights. The highlighted row in the heatmap shows the same weights.

Click a token chip above.

Attention Heatmap

Each cell shows softmax(Q·K^T / √d_k) for that query–key pair. Brighter = higher weight. The selected query's row is highlighted in orange — matching the bars on the left.

low medium high selected row

Query

💡 Most rows are flat for a given head — only tokens with the right features show clear patterns. Try clicking "cat" with Head 1, "on" with Head 3, or "old" (sentence 3) with Head 4.

For the sentences used above, here are the toy 8-dimensional embeddings and how they produce the attention scores. Each token is represented as a vector with 8 hand-crafted dimensions (darker = higher value) — these named axes are illustrative only. In a real trained model, embedding dimensions have no human-interpretable labels; they are arbitrary directions in ℝ^d that gradient descent found useful.

In this toy model, d = 8 (the embedding dimension) and there are h = 4 heads, so d_k = d / h = 2. W_Q and W_K each project from ℝ⁸ down to ℝ² — this is why Q and K have only 2 entries each in the worked example below, and why only 2 columns of W_Q and W_K are non-zero. Each head can only "look for" signal along 2 dimensions — its d_k-dimensional receptive field in the relational sense. With 4 such heads tiling the 8-dimensional space, the model can simultaneously track 4 distinct relationship types.

The Database Query Analogy

Think of attention as a soft database lookup — unlike a hard lookup returning exactly one row, attention returns a weighted blend of all values, with weights determined by query–key similarity.

Query (Q) — "What information am I looking for?" A learned projection of the current token into a subspace that encodes what kind of context it needs.

Key (K) — "What do I advertise that I contain?" A learned projection of each token into a matching subspace. The dot product Q·Kᵀ scores compatibility — how much a query should attend to each key.

Value (V) — "What I actually return when retrieved." A third projection, independent of the scoring. The final output is a weighted sum of all value vectors, with weights given by softmax(Q·Kᵀ / √d_k).

The separation of K and V is what makes this powerful: a token can advertise one thing (key) and return something else (value). The scoring and the content retrieval are deliberately decoupled.

The Math

In the full matrix formulation, Q, K, and V are matrices — every token's embedding is projected through W_Q, W_K, and W_V respectively, producing one row per token. Q·K^T is an N×N matrix of all pairwise raw scores; after dividing by √d_k and applying softmax row-wise, each row sums to 1 — these normalized weights are what the Attention Heatmap above shows. Multiplying by V then produces one output vector per token.

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

√dₖ scaling: dot products grow in magnitude with dimension, pushing softmax toward near-one-hot distributions with vanishingly small gradients. Dividing by √dₖ keeps the variance stable.

Multi-Head: concat(head₁,…,headₕ) · W_o headᵢ = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ)

Each head projects into a different learned subspace, allowing different heads to capture different relationship types simultaneously — syntactic, semantic, positional, co-reference, etc.

Why multiple heads? In "The bank near the river overflowed its banks," the word bank is ambiguous. One head might track subject–verb relationships, another might disambiguate nouns via surrounding context, another might track long-distance dependencies. Multiple heads let the model hold several structural hypotheses simultaneously without committing to one projection subspace.

Inductive biases: CNN vs. RNN vs. Transformer. Every architecture encodes assumptions about where useful structure lives in the input:

CNN — assumes features are local and spatially invariant. Nearby tokens matter most; the same pattern is worth detecting anywhere. Works well for images and short n-gram features, but capturing global context requires stacking many layers.
RNN — assumes inputs arrive sequentially and compresses all prior history into a fixed-size hidden state. LSTMs partially address the memory bottleneck with learned gates that adaptively control what to write, keep, and forget.
Transformer — makes almost no structural assumption. Any token can attend to any other token regardless of distance, with weights computed fresh for each input. Attention weights play the same functional role as LSTM gates — both adaptively route information — but unlike an RNN, the transformer has no persistent state between inputs: all context must fit in the window. Maximum flexibility, but needs large data to learn where to look, and scales quadratically in sequence length.

Self-Supervised Pre-Training

One of the transformer's most powerful features: training labels are generated automatically from raw text — no human annotation required. This is the connection to latent learning: mere exposure to structured experience drives the formation of rich internal representations.

The key insight: Language carries its own supervision signal. By hiding parts of the input and training the model to recover them, we force it to build an internal model of grammar, semantics, and world knowledge — all from raw text, without any labeled data.

Training objective

Sentence

Masked Language Modeling (MLM)

Click tokens to mask them (toggle). In BERT-style MLM, 15% of tokens are randomly selected for prediction. Of those selected tokens: 80% are replaced with the special [MASK] token, 10% are replaced with a random vocabulary token, and 10% are left unchanged. All 15% of selected tokens contribute to the training loss — the remaining 85% pass through the forward pass but produce no loss. The three-way split prevents the model from over-specializing to inputs containing [MASK], since at fine-tuning and inference time no such tokens appear. All selected positions are predicted simultaneously in a single forward pass — the full corrupted sequence is processed once and predictions for all selected positions are made together.

Bidirectional context: Because masks are placed randomly and the model sees the full sequence around each mask, BERT-style models learn deeply contextualized representations. They excel at understanding tasks (question answering, classification, named entity recognition) but cannot straightforwardly generate text — you can't unmask one token, then the next, because the masks were the training signal, not a generation procedure.

MLM Attention Mask

Every position can attend to every other position — there is no blocking. Columns for masked tokens are highlighted orange: every row attending to a [M] column is attending to a substituted token, not the real word. Green rows are the 15% of selected positions whose predictions contribute to the training loss — this includes all three substitution types (80% [MASK], 10% random token, 10% unchanged). The remaining 85% of unselected positions pass through the forward pass but produce no training loss.

Query

normal → normal [M] row → normal col normal row → [M] col [M] row → [M] col

Connection to Latent Learning

🐀 Latent Learning (Tolman, 1948)

Rats explore a maze without reward, apparently building no useful associations. When reward is introduced, they immediately exploit a near-optimal path — revealing that a rich cognitive map had been forming silently from exposure to structure alone.

🤖 Self-Supervised Pre-Training

A transformer is exposed to billions of tokens without task-specific labels. Internal representations form solely from statistical structure. When fine-tuned on a downstream task, the model immediately leverages its pre-trained representations — the latent structure it built during pre-training.

Self-Supervised vs. Supervised: A Parallel Controversy

The debate about whether self-supervised learning is genuinely different from supervised learning has a striking parallel in the history of psychology. Associative learning — encompassing both classical conditioning (Pavlov, Watson) and operant conditioning (Skinner) — argues that all learning reduces to connections formed through reinforcement or pairing. In classical conditioning, the organism already produces the correct response; training pairs it with a new stimulus. In operant conditioning, a consequence (reward or punishment) shapes the frequency of a behavior. Supervised learning maps onto both: the loss function acts like continuous negative reinforcement removed when the correct output is produced, while the pairing of input contexts with target outputs mirrors classical associative pairing. In all cases, an external agent decides what the correct response is and applies a signal that drives behavior toward it.

Tolman countered that animals build internal cognitive maps through mere exposure, without reinforcement, and that these latent representations transfer flexibly to new tasks. Behaviorists pushed back: isn't "latent learning" just associative learning with internal states as hidden mediators? The argument has never been fully resolved and remains live in behavioral neuroscience today.

The same spirit animates the supervised vs. self-supervised debate. The mechanistic skeptic's view: self-supervised learning is just supervised learning where the labels are auto-generated — inputs, targets, loss, backprop. The key distinction is where the labels come from: in supervised learning they are produced by an external agent at real cost; in self-supervised learning they emerge automatically from the structure of the data itself, with no external agent deciding what is correct.

This matters for scale — you can train on all text ever written — and for generality — predicting the next token forces the model to implicitly learn grammar, facts, coreference, and reasoning, not just whatever a human decided to label.

Causal LM (GPT-style) is the most borderline case. In MLM (BERT-style), the masking is random and artificially imposed — the model can't anticipate which tokens will be masked, making it feel more like genuine exploration. In causal LM, the targets are entirely predictable from the data structure: previous tokens in, next token out. The strongest defenses are: (1) no label file exists — the supervision signal is a mechanical consequence of sequential structure, not a human decision; (2) density — a sentence of N tokens yields N−1 training signals in one pass; (3) the targets aren't chosen by the researcher — you predict whatever natural language produces next, which encodes everything.

The honest conclusion, much like in the associative vs. latent learning debate: the line is real but blurry. Self-supervised is a genuinely different paradigm — annotation-free, scale-unlimited, general-purpose, and driven by environmental structure rather than an external agent — even if the gradient math is not fundamentally different from supervised learning.

Contrastive learning (SimCLR, MoCo, CLIP, DINO) takes this further by making the representational goal explicit: similar inputs should produce similar [CLS] vectors; dissimilar inputs should not. CLIP trains two encoders — one for images, one for text — so that matching image-caption pairs are close in the joint embedding space and mismatched pairs are far apart. The supervision signal comes entirely from naturally occurring image-caption pairs on the internet, with no human labels. This is self-supervised learning in its purest form: the structure of the world (images tend to co-occur with descriptive captions) provides the signal.

Self-Supervised Training: Pseudocode

Toggle between MLM (BERT-style) and CLM (GPT-style) training loops. Both use the same transformer forward pass — what differs is how the input is corrupted and where the loss is applied.

# MLM pre-training loop (BERT-style)
# @ denotes matrix multiplication (as in NumPy/PyTorch)
# .T denotes matrix/vector transpose (M.T is Mᵀ, as in NumPy/PyTorch)

# For each sequence in corpus:
for sequence in corpus:
    tokens = tokenize(sequence)         # e.g. ["the","cat","sat",...]
    tokens = prepend([CLS], tokens)     # add [CLS] at position 0

    # Select 15% of token positions for prediction
    selected = random_sample(positions, p=0.15)
    labels   = tokens[selected]         # save original tokens as targets

    # Apply 3-way substitution to selected positions:
    for i in selected:
        r = uniform(0, 1)
        if   r < 0.80: tokens[i] = [MASK]                # 80% → [MASK]
        elif r < 0.90: tokens[i] = random_vocab_token()  # 10% → random token
        # else:          tokens[i] unchanged             # 10% → unchanged

    # Forward pass — all N tokens processed simultaneously
    representations = encoder_forward(tokens)  # (N, d)

    # Compute loss only at selected positions
    logits = representations[selected] @ E.T   # (|selected|, vocab_size)
    loss   = cross_entropy(logits, labels)     # unselected positions contribute nothing

    # Backpropagate and update weights
    loss.backward()
    optimizer.step()

# CLM pre-training loop (GPT-style)
# @ denotes matrix multiplication (as in NumPy/PyTorch)
# .T denotes matrix/vector transpose (M.T is Mᵀ, as in NumPy/PyTorch)

# For each sequence in corpus:
for sequence in corpus:
    tokens = tokenize(sequence)  # e.g. ["the","cat","sat",...]

    # Inputs and targets are the same sequence, shifted by one position
    inputs  = tokens[:-1]  # ["the","cat","sat","on","the","warm"]
    targets = tokens[1:]   # ["cat","sat","on","the","warm","mat"]

    # Forward pass with causal mask — all positions processed in parallel
    # but position i can only attend to positions 0..i (no peeking right)
    representations = decoder_forward(inputs)  # (N-1, d)

    # Every position predicts its next token — N-1 training signals per sequence
    logits = representations @ E.T             # (N-1, vocab_size)
    loss   = cross_entropy(logits, targets)    # all positions contribute to loss

    # Backpropagate and update weights
    loss.backward()
    optimizer.step()

# Note: no human labels anywhere — the sequence is its own supervision signal.
# Any corpus of text yields training pairs automatically by this sliding-window logic.

Key References

Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019, pp. 4171–4186. ACL Anthology N19-1423
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. — arXiv preprint; no formal conference venue. arXiv:1907.11692
Chen, T. et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020, PMLR 119:1597–1607. PMLR proceedings
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021, PMLR 139:8748–8763. PMLR proceedings
Tolman, E.C. (1948). Cognitive maps in rats and men. Psychological Review, 55(4), 189–208. doi:10.1037/h0061626

Vision Transformer (ViT)

A Vision Transformer applies the exact same self-attention from Tab ② — but to image patches instead of words. An image is split into a grid of patches, each flattened into a vector ("patch token"), and then processed by a standard transformer encoder.

The key insight: There is nothing inherently linguistic about self-attention. Any collection of objects that might have meaningful relationships to each other can be processed with it — image patches, graph nodes, protein residues, musical notes, simulation grid cells. The mechanism is modality-agnostic.

Image

Patch grid: 4×4

Image + Patch Grid (click a patch to select)

Attention Map for Selected Patch

ViT vs. CNN: A CNN builds global context gradually through depth, with a strong inductive bias toward local spatial structure. A ViT can attend globally from layer one but must learn all spatial relationships from data. On large datasets ViT tends to match or beat CNNs; on smaller datasets the CNN's built-in locality assumption is an advantage — less to learn.

Token Sequence Input to the Transformer

ViT uses the same encoder forward pass as BERT — the only difference is that patch embeddings replace token embeddings as input. The pseudocode on Tab ① (Encoder Only) applies directly; just substitute patch projections for the embedding matrix lookup.

Patches are numbered in raster scan order — left to right, row by row (like reading text). Without positional encodings, self-attention is permutation-invariant and the model would have no way of knowing that patch 3 is to the right of patch 2. A learned positional encoding vector is added to each patch token before the transformer so the model knows where each patch came from spatially. The original ViT paper (Dosovitskiy et al., 2020) found that these learned encodings spontaneously develop a 2D grid structure — nearby patches end up with similar positional embeddings.

A [CLS] token is prepended; its output is used for classification. In supervised ViT training, [CLS]'s output feeds a linear classifier trained on labeled images. In self-supervised ViT training (DINO), no labels are used — instead, two differently augmented views of the same image are passed through the network and their [CLS] outputs are trained to be similar, while [CLS] outputs from different images are pushed apart. This is contrastive learning applied to vision — the same principle as CLIP (Tab ③), but within a single modality.

Key References

Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021. OpenReview
Caron, M. et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. CVF proceedings
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021, PMLR 139:8748–8763. PMLR proceedings
Clark, K. et al. (2019). What Does BERT Look at? An Analysis of BERT's Attention. ACL 2019 BlackboxNLP Workshop. ACL Anthology W19-4828