Toward Multimodal AI

© 2026 Theodore P. Pavlic ·
MIT License

Large multimodal models — systems that can see images and read text and reason across both — did not emerge all at once. They rest on three ideas built in sequence: a way to turn a visual scene into a flexible token-based representation; a way to train image and text encoders to share a common semantic space; and the insight that once visual information is in that space, a language model's reasoning machinery applies unchanged. This module builds that story from the ground up.

1
From Pixels to Tokens
Why the shift from CNNs to Vision Transformers was necessary — and what the [CLS] token is.
2
Joint Embedding Spaces
How contrastive training (CLIP) maps images and text into a shared geometry where cosine similarity equals semantic similarity.
3
Multimodal Reasoning
How visual tokens are projected into the same space as text tokens so a language model's attention can reason over both.

Before any multimodal model can read and see simultaneously, it needs a way to encode images into a compact global embedding vector — the same currency text models already use. In principle, any image encoder can serve this role: CLIP's original paper (Radford et al. 2021) shipped both ResNet and ViT variants. But CNNs produce local features bound to spatial positions, with no single vector naturally representing "the whole image" — a global summary has to be forced out of pooling operations the architecture was not designed around. Vision Transformers (ViTs) solve this more cleanly through attention-based patch processing, producing a special [CLS] token — a vector with no spatial assignment that is explicitly trained to aggregate the entire scene. That clean global summary is why ViTs have become the preferred image encoder for CLIP and for the multimodal LLMs in Tab 3. Click any pixel or patch and use the controls to see how each architecture handles spatial information.

Convolutional Neural Network
Input · no convolution
Click any pixel, then increase depth. The orange region shows the receptive field — every pixel contributing to that neuron's activation.
Vision Transformer (ViT)
Attends strongly to nearby patches — similar to what a CNN filter would do. ViTs can learn locality when it is useful, but they are not forced to.
Click any patch. The color intensity shows attention weight. Unlike CNNs, attention is global from the first layer — heads specialize, not layers.

CNN Inductive Biases

Locality means every neuron connects only to a small spatial neighborhood; the receptive field grows only as depth increases through composition of local windows. Shared filter weights enforce translation invariance: a feature detector fires identically wherever the feature appears. These structural priors are enormously helpful when training data are scarce. A global image summary can be extracted from a CNN — typically by average-pooling the final feature map — but this is a workaround, not a designed output. The architecture was built to produce spatial feature maps, and collapsing them into one vector discards the structured reasoning that makes transformers powerful at aggregation.

ViT and the [CLS] Token

A ViT splits the image into patches, embeds each as a token, and prepends a special [CLS] (classification) token with no spatial assignment. Through multi-head self-attention, [CLS] learns to aggregate information from every patch simultaneously. After the final layer, the [CLS] vector is a compact, position-independent embedding of the entire scene — exactly the shape a language model can work with. This is the token that CLIP and later multimodal models project into a joint image–text embedding space.

Performance vs. Training Data Scale

CNN's locality and translation-invariance priors help most at small scale. The ViT's unconstrained global attention becomes an advantage once data is plentiful — and ViTs scale better because they have no built-in spatial bottleneck.

A global image embedding is the bridge to Tab 2 →

Everything in the next tab depends on having a single embedding vector that represents an entire image. Any encoder can produce one — CLIP was released with both ResNet and ViT variants — but the ViT's [CLS] token is architecturally the cleanest solution: it is a vector with no spatial assignment, explicitly trained by the attention mechanism to aggregate the whole scene. CNNs require average-pooling a spatial feature map instead, a step that discards structural information the architecture never intended to compress into one vector. In practice, ViT-based CLIP models substantially outperform their ResNet counterparts at the same compute budget, and ViT encoders have become the standard for all major multimodal LLMs. The [CLS] vector that results — where cosine similarity measures semantic similarity — is what makes zero-shot classification and, ultimately, multimodal reasoning possible.

References
  1. LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. doi:10.1162/neco.1989.1.4.541
  2. He, K. et al. (2016). Deep residual learning for image recognition. Proc. CVPR 2016, 770–778. CVF · doi:10.1109/CVPR.2016.90
  3. Dosovitskiy, A. et al. (2021). An image is worth 16×16 words. ICLR 2021. OpenReview
  4. Caron, M. et al. (2021). Emerging properties in self-supervised vision transformers. ICCV 2021. CVF · doi:10.1109/ICCV48922.2021.00951

The [CLS] token from Tab 1 is a vector in a high-dimensional space — but so is a word embedding. CLIP (Contrastive Language–Image Pretraining, Radford et al. 2021) demonstrated that you can co-train a ViT image encoder and a text encoder so their outputs land at the same location whenever image and text describe the same concept. The resulting space has a remarkable property: cosine similarity measures semantic proximity. A photograph of a dog and the phrase "a dog" are angularly close; a photograph and "a red sports car" are angularly far apart. This geometry enables zero-shot classification — and, as Tab 3 shows, it makes multimodal reasoning tractable.

circle = image embedding square = text embedding dashed line = matched image–text pair
Selection Details

Click any point in the embedding space to see cosine similarity scores and rankings.

The [CLS] Token & CLIP Embeddings

CLIP's image encoder is a ViT. The vector it contributes to the joint embedding space is the [CLS] token from Tab 1 — a global, position-free summary of the entire image. The text encoder produces an analogous aggregate vector at the end of the text sequence. Co-training forces these two vectors to be geometrically aligned when image and caption describe the same thing. The result: any downstream task that can be framed as a similarity comparison — retrieval, classification, grounding — becomes a single vector dot product.

InfoNCE Loss

Training is driven by the InfoNCE (Noise-Contrastive Estimation) loss. Given a batch of N matched image–text pairs, the model computes all N×N pairwise cosine similarities. The loss maximizes the N diagonal entries (true matches) while simultaneously minimizing each of the N−1 off-diagonal entries per row and column. With a learned temperature τ:

L = −(1/N) Σᵢ log[ e(sim(iᵢ,tᵢ)/τ) / Σⱼ e(sim(iᵢ,tⱼ)/τ) ]

CLIP used batch sizes up to 32,768, flooding the loss with hard negatives and sharpening the learned geometry considerably.

The joint embedding space points toward Tab 3 →

Both paradigms here — text-to-text classification and text-to-image retrieval — treat the embedding space as a lookup: find the nearest neighbor and return a label. But what if you wanted to reason freely about the content of an image? Ask "is the dog in the foreground or background?" or "describe what's happening in this scene?" That requires a language model that can attend jointly over visual tokens and text. In Tab 3 you'll see how a single linear projection maps this embedding space into a language model's token vocabulary — making open-ended visual reasoning possible without changing the architectures you've already explored.

References
  1. Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021, PMLR 139. PMLR
  2. Chen, T. et al. (2020). A simple framework for contrastive learning of visual representations. ICML 2020, PMLR 119. PMLR
  3. Dosovitskiy, A. et al. (2021). An image is worth 16×16 words. ICLR 2021. OpenReview

CLIP (2021) proved that a joint image–text embedding space was achievable. The next step — taken by models like LLaVA, BLIP-2, and GPT-4V (2022–2023) — was to ask: what if we fed visual information directly into a language model alongside text? The key is understanding what a language model actually processes. It never sees discrete words: each text subword is looked up in an embedding table and converted to a continuous vector (say, 4,096-D). Those vectors are what flow through the transformer's attention layers. The linear projection (shown below) maps each ViT patch embedding into that same 4,096-D space — producing continuous vectors with similar statistical properties to text token embeddings. It is the only component trained during multimodal adaptation: with both the ViT and the LLM frozen, its weights are learned via supervised next-word prediction on image–caption pairs, so that the LLM's generation loss flows back through the frozen LLM and updates the projection until the visual vectors it produces let the LLM generate accurate descriptions. The LLM then attends over the entire concatenated sequence — visual embedding vectors and text embedding vectors — using the same weights, with no awareness of which came from which source. No new architecture is required. Both input streams become the same kind of continuous vector before they reach the transformer.

1Input image, divided into patches
2Patch embedding vectors + [CLS] vector output by the ViT
3Linear Projection — the only component trained during multimodal adaptation
4Input text query
5Integer token IDs produced by the tokenizer
6Embedding Table — learned during LLM pretraining, frozen during multimodal adaptation
Trace the pipeline for a specific query

Select a scenario below to watch the full pipeline animate step by step — from raw image patches through the ViT encoder, linear projection, and joint reasoning in the LLM.

Select a scenario
Pipeline trace

What the linear projection actually does — and how it learns

It is a single learned matrix multiplication: each ViT patch embedding (e.g., 768-D) is multiplied by a (4,096 × 768) weight matrix, producing a 4,096-D vector. These are not "text tokens" in any vocabulary sense — they are continuous vectors in the same high-dimensional space that text token embeddings occupy. The LLM's transformer layers process any sequence of 4,096-D vectors; they cannot tell whether a vector came from the vocabulary lookup table or from the projection.

The projection is also the only component being trained during multimodal adaptation. The ViT (CLIP pretrained) is frozen. The entire LLM — including its embedding table — is frozen. Backpropagation on image–caption generation tasks sends gradients backward through the frozen LLM layers into the projected visual embeddings, and those gradients update the projection matrix. It learns to produce whatever continuous vectors make the frozen LLM generate accurate text about image content — not "find the nearest vocabulary token," but find the point in 4,096-D space that best serves the downstream task.

References
  1. Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021, PMLR 139. PMLR
  2. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML 2023, PMLR 202, pp. 20351–20383. PMLR
  3. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning (LLaVA). NeurIPS 2023 (Oral). NeurIPS Proceedings