Transformer Architecture Explorer

Attention · Self-Supervision · Vision Transformers
© 2026 Theodore P. Pavlic ·
MIT License

Transformer Architecture

Click any block in the encoder diagram to learn what it does. Despite its power, a transformer is fundamentally a feedforward neural network — at inference it has no persistent internal state, and its only "memory" is whatever fits in the current context window. What makes it distinctive is how it mixes information: rather than compressing history into a hidden state (RNN) or restricting interactions to a local patch (CNN), every token can directly attend to every other token in a single parallel step. The architecture's inductive bias is deliberately weak — it makes almost no assumptions about where relevant information will be — which is both its strength and the reason it needs large amounts of training data to work well. See Tab ② for a deeper comparison with CNNs and RNNs.

Why "transformer"? Unlike an RNN, which passes context through a fixed-size hidden state sequentially, a transformer directly connects every token to every other token in a single step. Each encoder layer transforms the entire set of token representations in parallel, enabling rich, long-range contextual mixing from the very first layer — hence the name. The transformation at each layer is what's distinctive: global, parallel, and context-mixing, rather than sequential (RNN) or local (CNN).
Input Tokens
Token Embeddings
Positional Encoding
× N layers
Multi-Head Self-Attention
↓ residual skip connection
Add & Norm (1)
Cross-Attention (optional)
encoder-decoder only
Feed-Forward Network (FFN)
↓ residual skip connection
Add & Norm (2)
Output Representations

← Click a block to explore it


Encoder · Decoder · Encoder-Decoder. The diagram above shows a transformer encoder (BERT-style): every token attends to every other token, making it ideal for understanding tasks. A decoder (GPT-style) uses a causal mask so each token can only attend to tokens that came before it — enabling autoregressive generation (see Tab ③). A full encoder-decoder model (the original "Attention Is All You Need" architecture, T5, BART) uses both halves connected by a cross-attention bridge, letting a generated output sequence attend to an encoded input sequence.

Parameter Count at a Glance

For a standard encoder with vocabulary size V, model dimension d (also written dmodel — the dimensionality of every token's representation throughout the network), N layers, and learned positional encodings up to sequence length L:

Total ≈ V·d + L·d + N·12d² Token embeddings: V·d Positional encodings: L·d (if learned; zero if sinusoidal) Per encoder layer: 4d² (MHSA: W_Q, W_K, W_V, W_O) + 8d² (FFN: W_1, W_2, including biases) + 4d (LayerNorm γ,β ×2 — negligible) ≈ 12d² per layer

The N·12d² term dominates for large models — parameter count scales with depth and the square of model dimension. Doubling d quadruples per-layer cost.

Model V d N h ~Params
BERT-base 30k 768 12 12 ~110M
BERT-large 30k 1024 24 16 ~340M
GPT-2 50k 768 12 12 ~117M
GPT-3 50k 12288 96 96 ~175B

BERT-base check: 30k×768 + 12×12×768² ≈ 23M + 85M ≈ 108M ✓

Transformer Topologies

Encoder Only
↕ bidirectional self-attention
Input Tokens
Encoder Layer ×Neach token attends to all others
Contextual Representations

The output is a rich contextual vector for every token — not a generated word. These vectors encode meaning geometrically: similar meanings land nearby in the vector space, so a query and a relevant page will have similar representations even if they share no keywords. For more demanding tasks (e.g., Q&A), the query and candidate passage can be concatenated as a single input so their tokens attend to each other directly, and the resulting [CLS] vector is trained to score how well they match.

BERT · RoBERTa · ELECTRA
📌 Google Search (2019) — BERT transformed how Google understands query intent and page meaning, one of the largest NLP deployments in history
Classification · NER · Embeddings · Extractive QA
Decoder Only
causal attention — left tokens only
(each token predicts the next; can't peek right)
Tokens generated so far
Decoder Layer ×Ntoken attends to left tokens only
Next Token Distribution

Each forward pass predicts the next token given all previous ones, appends it, then repeats. At training time this is self-supervised (the sequence is its own label); at inference time it generates. There is no separate "task head" — the generation IS the output.

GPT-2/3/4 · LLaMA · Mistral · Claude
📌 ChatGPT / Claude / Copilot — virtually every modern AI assistant and code completion tool uses this architecture
Text generation · Chat · Code · Language modeling
Encoder–Decoder
cross-attention bridge
Source tokens
Encoder ×N
cross-attn
Target tokens
Decoder ×N
Next Token Distribution

The encoder runs first on the full source sequence, producing N precomputed output vectors — one per source token. These are then held fixed. The decoder generates the output sequence token by token, with causal self-attention over the target tokens generated so far, plus cross-attention that queries into the encoder's precomputed vectors. Source and target tokens never occupy the same sequence — they only interact through the cross-attention lookup.

T5 · BART · "Attention Is All You Need"
📌 Google Translate & DeepL — the encoder reads the source language, the decoder generates the translation token by token
Translation · Summarization · Seq2Seq · Document Q&A

Inference in Practice: Concrete Examples

Regardless of architecture, a transformer always does the same core thing: it takes N input tokens and produces N output vectors in ℝd, one per token — what differs is which vectors are used, and how.

N can be very large. In a chat application, every prior message in the conversation is part of the context — by the time you are deep into a long conversation, N might be tens of thousands of tokens. GPT-4 supports up to 128,000 tokens; some models go beyond 1 million. Every one of those tokens is present in the forward pass, which is why long contexts are expensive and why the quadratic cost of attention (the N×N matrix) is such an active research problem.

Encoder Only
[CLS] the cat sat on the mat
N = 7 tokens in ([CLS] prepended)
Encoder (N layers)
full bidirectional attention
7 vectors in ℝd
N vectors out — all at once
Task head:
• [CLS] vector → classifier (sentiment, topic)
• All vectors → label per token (Named Entity Recognition: person, location, etc.)
• Query+passage → relevance score (e.g., for search)
Decoder Only
"the cat sat"
N = 3 tokens so far
Decoder (N layers)
causal mask: left tokens only
3 vectors in ℝd
each predicts its next token
only the last vector is output
→ sample "on" → appendrepeat
→ sample "the" → appendrepeat
→ sample "mat" → appendrepeat
→ sample [EOS]stop
[EOS] is a learned special token — the model generates it when the output is complete
💡 K and V for previous tokens are cached — only the new token's Q, K, V are computed each step
Encoder–Decoder
"le chat s'est assis"
source: N tokens — encoder runs first
Encoder → N vectors
precomputed once; held fixed
↕ cross-attention
"the cat sat"
target so far: M = 3 tokens
Decoder (N layers)
causal self-attn + cross-attn to source
3 vectors in ℝd
each predicts its next token
only the last vector is output
→ sample "on" → appendrepeat
→ sample "the" → appendrepeat
→ sample "mat" → appendrepeat
→ sample [EOS]stop
[EOS] is a learned special token — the model generates it when the output is complete

Key References

Vaswani, A. et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 5998–6008. NeurIPS proceedings
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019, pp. 4171–4186. ACL Anthology N19-1423
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI technical report. — GPT-2; no formal venue. OpenAI PDF
Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). NeurIPS proceedings
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. — arXiv preprint; no formal conference venue. arXiv:1907.11692

Attention Mechanism

Self-attention lets each token "query" all other tokens to build a contextual representation. All matrix operations shown below use real (toy) numbers computed in your browser — no fake data.

What is an attention head, concretely? A head is nothing more than its own set of three weight matrices: WQ, WK, and WV (each d×dk). That's the entire implementation — there is no other structure. For h heads, the model has h×3 such matrices. They are initialized randomly and trained end-to-end; the "concept" a head represents (subject↔verb, determiner↔noun, etc.) is not specified in advance — it emerges from gradient descent because specialization turns out to be useful. After training, a single shared WO matrix (hdk×d) projects the concatenated outputs of all heads back into the original d-dimensional space.

How to use the controls below: The sentence is the full input to the transformer — all tokens are processed simultaneously, not one at a time. The attention weights reveal how the model internally relates each token to every other token in the context of that sentence. Select a head, then click a token in the sentence to see, from that token's perspective, which other tokens it attends to most strongly under the concept that head has learned. Switching heads shows how different learned subspaces pick out different relationships within the same sentence.

Note: the four heads here are hand-crafted illustrations of the kinds of relationships that might emerge from training — not a guarantee of what will. Real trained transformers may carve up the space quite differently, and individual heads often don't correspond cleanly to a single human-interpretable concept.

Tokens & Embeddings

Click a token to set it as the query — you are selecting a viewpoint within the sentence, not entering a search term. The heatmap and score bars then show how strongly that token attends to every other token in the sentence under the currently selected head.

Query Token: Scores & Weights

For the selected query token, raw dot-product scores (Q·K / √dk) and their softmax-normalized attention weights. The highlighted row in the heatmap shows the same weights.

Click a token chip above.

Attention Heatmap

Each cell shows softmax(Q·KT / √dk) for that query–key pair. Brighter = higher weight. The selected query's row is highlighted in orange — matching the bars on the left.

low medium high selected row
Query

💡 Most rows are flat for a given head — only tokens with the right features show clear patterns. Try clicking "cat" with Head 1, "on" with Head 3, or "old" (sentence 3) with Head 4.

Token Embeddings (toy, 8-dimensional)

Each token is represented as a vector with 8 hand-crafted dimensions. Darker = higher value. These named axes are illustrative only — in a real trained model, embedding dimensions have no human-interpretable labels. They are arbitrary directions in ℝd that gradient descent found useful; interpretable structure (if any) is emergent, not designed.

The embedding table connects directly to the attention scores. Here Q and K are matrices — every token's embedding is projected through WQ and WK respectively to produce one row each. For Head 1, WQ picks out the animate (×2) and subj (×1) dimensions; WK picks out verb (×2). The score for a single (query, key) pair is one cell of the full Q·KT matrix — e.g., for "cat" querying "sat":

Qcat = cat·WQ = [0.9×2 + 0.8×0 + 0.0×0 + … + 0.8×1 + 0.2×0] = [2.6, 0, 0, 0] (WQ — animate col: ×2, subj col: ×1, all other cols: ×0) Ksat = sat·WK = [0.0×0 + 0.0×0 + 0.9×2 + 0.0×0 + … + 0.0×0] = [1.8, 0, 0, 0] (WK — verb col: ×2, all other cols: ×0) Qcat·Ksat / √dk = 2.6 × 1.8 / √4 = 2.34 ✓ (matches score panel above)

For "the" querying "sat": "the" has animate ≈ 0 and subj ≈ 0.1, giving Q ≈ [0.1, 0, 0, 0], so Q·K / √dk ≈ 0.09 — nearly zero, confirming the flat row you see for function words in Head 1.

The Database Query Analogy

Think of attention as a soft database lookup — unlike a hard lookup returning exactly one row, attention returns a weighted blend of all values, with weights determined by query–key similarity.

Query (Q) — "What information am I looking for?" A learned projection of the current token into a subspace that encodes what kind of context it needs.

Key (K) — "What do I advertise that I contain?" A learned projection of each token into a matching subspace. The dot product Q·Kᵀ scores compatibility — how much a query should attend to each key.

Value (V) — "What I actually return when retrieved." A third projection, independent of the scoring. The final output is a weighted sum of all value vectors, with weights given by softmax(Q·Kᵀ / √dk).

The separation of K and V is what makes this powerful: a token can advertise one thing (key) and return something else (value). The scoring and the content retrieval are deliberately decoupled.

The Math

In the full matrix formulation, Q, K, and V are matrices — every token's embedding is projected through WQ, WK, and WV respectively, producing one row per token. Q·KT is an N×N matrix of all pairwise raw scores; after dividing by √dk and applying softmax row-wise, each row sums to 1 — these normalized weights are what the Attention Heatmap above shows. Multiplying by V then produces one output vector per token.

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

√dₖ scaling: dot products grow in magnitude with dimension, pushing softmax toward near-one-hot distributions with vanishingly small gradients. Dividing by √dₖ keeps the variance stable.

Multi-Head: concat(head₁,…,headₕ) · W_o headᵢ = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ)

Each head projects into a different learned subspace, allowing different heads to capture different relationship types simultaneously — syntactic, semantic, positional, co-reference, etc.

Why multiple heads? In "The bank near the river overflowed its banks," the word bank is ambiguous. One head might track subject–verb relationships, another might disambiguate nouns via surrounding context, another might track long-distance dependencies. Multiple heads let the model hold several structural hypotheses simultaneously without committing to one projection subspace.
Inductive biases: CNN vs. RNN vs. Transformer. Every architecture encodes assumptions about where useful structure lives in the input:
  • CNN — assumes features are local and spatially invariant. Nearby tokens matter most; the same pattern is worth detecting anywhere. Works well for images and short n-gram features, but capturing global context requires stacking many layers.
  • RNN — assumes inputs arrive sequentially and compresses all prior history into a fixed-size hidden state. LSTMs partially address the memory bottleneck with learned gates that adaptively control what to write, keep, and forget.
  • Transformer — makes almost no structural assumption. Any token can attend to any other token regardless of distance, with weights computed fresh for each input. Attention weights play the same functional role as LSTM gates — both adaptively route information — but unlike an RNN, the transformer has no persistent state between inputs: all context must fit in the window. Maximum flexibility, but needs large data to learn where to look, and scales quadratically in sequence length.

Self-Supervised Pre-Training

One of the transformer's most powerful features: training labels are generated automatically from raw text — no human annotation required. This is the connection to latent learning: mere exposure to structured experience drives the formation of rich internal representations.

The key insight: Language carries its own supervision signal. By hiding parts of the input and training the model to recover them, we force it to build an internal model of grammar, semantics, and world knowledge — all from raw text, without any labeled data.

Masked Language Modeling (MLM)

Click tokens to mask them (toggle). In BERT-style MLM, 15% of tokens are randomly selected for prediction. Of those selected tokens: 80% are replaced with the special [MASK] token, 10% are replaced with a random vocabulary token, and 10% are left unchanged. All 15% of selected tokens contribute to the training loss — the remaining 85% pass through the forward pass but produce no loss. The three-way split prevents the model from over-specializing to inputs containing [MASK], since at fine-tuning and inference time no such tokens appear. All selected positions are predicted simultaneously in a single forward pass — the full corrupted sequence is processed once and predictions for all selected positions are made together.

Bidirectional context: Because masks are placed randomly and the model sees the full sequence around each mask, BERT-style models learn deeply contextualized representations. They excel at understanding tasks (question answering, classification, named entity recognition) but cannot straightforwardly generate text — you can't unmask one token, then the next, because the masks were the training signal, not a generation procedure.

MLM Attention Mask

Every position can attend to every other position — there is no blocking. Columns for masked tokens are highlighted orange: every row attending to a [M] column is attending to a substituted token, not the real word. Green rows are the 15% of selected positions whose predictions contribute to the training loss — this includes all three substitution types (80% [MASK], 10% random token, 10% unchanged). The remaining 85% of unselected positions pass through the forward pass but produce no training loss.

Query
normal → normal [M] row → normal col normal row → [M] col [M] row → [M] col

Connection to Latent Learning

🐀 Latent Learning (Tolman, 1948)

Rats explore a maze without reward, apparently building no useful associations. When reward is introduced, they immediately exploit a near-optimal path — revealing that a rich cognitive map had been forming silently from exposure to structure alone.

🤖 Self-Supervised Pre-Training

A transformer is exposed to billions of tokens without task-specific labels. Internal representations form solely from statistical structure. When fine-tuned on a downstream task, the model immediately leverages its pre-trained representations — the latent structure it built during pre-training.

Self-Supervised vs. Supervised: A Parallel Controversy

The debate about whether self-supervised learning is genuinely different from supervised learning has a striking parallel in the history of psychology. Associative learning — encompassing both classical conditioning (Pavlov, Watson) and operant conditioning (Skinner) — argues that all learning reduces to connections formed through reinforcement or pairing. In classical conditioning, the organism already produces the correct response; training pairs it with a new stimulus. In operant conditioning, a consequence (reward or punishment) shapes the frequency of a behavior. Supervised learning maps onto both: the loss function acts like continuous negative reinforcement removed when the correct output is produced, while the pairing of input contexts with target outputs mirrors classical associative pairing. In all cases, an external agent decides what the correct response is and applies a signal that drives behavior toward it.

Tolman countered that animals build internal cognitive maps through mere exposure, without reinforcement, and that these latent representations transfer flexibly to new tasks. Behaviorists pushed back: isn't "latent learning" just associative learning with internal states as hidden mediators? The argument has never been fully resolved and remains live in behavioral neuroscience today.

The same spirit animates the supervised vs. self-supervised debate. The mechanistic skeptic's view: self-supervised learning is just supervised learning where the labels are auto-generated — inputs, targets, loss, backprop. The key distinction is where the labels come from: in supervised learning they are produced by an external agent at real cost; in self-supervised learning they emerge automatically from the structure of the data itself, with no external agent deciding what is correct.

This matters for scale — you can train on all text ever written — and for generality — predicting the next token forces the model to implicitly learn grammar, facts, coreference, and reasoning, not just whatever a human decided to label.

Causal LM (GPT-style) is the most borderline case. In MLM (BERT-style), the masking is random and artificially imposed — the model can't anticipate which tokens will be masked, making it feel more like genuine exploration. In causal LM, the targets are entirely predictable from the data structure: previous tokens in, next token out. The strongest defenses are: (1) no label file exists — the supervision signal is a mechanical consequence of sequential structure, not a human decision; (2) density — a sentence of N tokens yields N−1 training signals in one pass; (3) the targets aren't chosen by the researcher — you predict whatever natural language produces next, which encodes everything.

The honest conclusion, much like in the associative vs. latent learning debate: the line is real but blurry. Self-supervised is a genuinely different paradigm — annotation-free, scale-unlimited, general-purpose, and driven by environmental structure rather than an external agent — even if the gradient math is not fundamentally different from supervised learning.

Contrastive learning (SimCLR, MoCo, CLIP, DINO) takes this further by making the representational goal explicit: similar inputs should produce similar [CLS] vectors; dissimilar inputs should not. CLIP trains two encoders — one for images, one for text — so that matching image-caption pairs are close in the joint embedding space and mismatched pairs are far apart. The supervision signal comes entirely from naturally occurring image-caption pairs on the internet, with no human labels. This is self-supervised learning in its purest form: the structure of the world (images tend to co-occur with descriptive captions) provides the signal.

Key References

Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019, pp. 4171–4186. ACL Anthology N19-1423
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. — arXiv preprint; no formal conference venue. arXiv:1907.11692
Chen, T. et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020, PMLR 119:1597–1607. PMLR proceedings
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021, PMLR 139:8748–8763. PMLR proceedings
Tolman, E.C. (1948). Cognitive maps in rats and men. Psychological Review, 55(4), 189–208. doi:10.1037/h0061626

Vision Transformer (ViT)

A Vision Transformer applies the exact same self-attention from Tab ② — but to image patches instead of words. An image is split into a grid of patches, each flattened into a vector ("patch token"), and then processed by a standard transformer encoder.

The key insight: There is nothing inherently linguistic about self-attention. Any collection of objects that might have meaningful relationships to each other can be processed with it — image patches, graph nodes, protein residues, musical notes, simulation grid cells. The mechanism is modality-agnostic.

Image + Patch Grid (click a patch to select)

Attention Map for Selected Patch

ViT vs. CNN: A CNN builds global context gradually through depth, with a strong inductive bias toward local spatial structure. A ViT can attend globally from layer one but must learn all spatial relationships from data. On large datasets ViT tends to match or beat CNNs; on smaller datasets the CNN's built-in locality assumption is an advantage — less to learn.

Token Sequence Input to the Transformer

Patches are numbered in raster scan order — left to right, row by row (like reading text). Without positional encodings, self-attention is permutation-invariant and the model would have no way of knowing that patch 3 is to the right of patch 2. A learned positional encoding vector is added to each patch token before the transformer so the model knows where each patch came from spatially. The original ViT paper (Dosovitskiy et al., 2020) found that these learned encodings spontaneously develop a 2D grid structure — nearby patches end up with similar positional embeddings.

A [CLS] token is prepended; its output is used for classification. In supervised ViT training, [CLS]'s output feeds a linear classifier trained on labeled images. In self-supervised ViT training (DINO), no labels are used — instead, two differently augmented views of the same image are passed through the network and their [CLS] outputs are trained to be similar, while [CLS] outputs from different images are pushed apart. This is contrastive learning applied to vision — the same principle as CLIP (Tab ③), but within a single modality.

Key References

Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021. OpenReview
Caron, M. et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. CVF proceedings
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021, PMLR 139:8748–8763. PMLR proceedings
Clark, K. et al. (2019). What Does BERT Look at? An Analysis of BERT's Attention. ACL 2019 BlackboxNLP Workshop. ACL Anthology W19-4828