Toward Multimodal AI — interactive explainer

Before any multimodal model can read and see simultaneously, it needs a way to encode images into a compact global embedding vector — the same currency text models already use. In principle, any image encoder can serve this role: CLIP's original paper (Radford et al. 2021) shipped both ResNet and ViT variants. But CNNs produce local features bound to spatial positions, with no single vector naturally representing "the whole image" — a global summary has to be forced out of pooling operations the architecture was not designed around. Vision Transformers (ViTs) solve this more cleanly through attention-based patch processing, producing a special [CLS] token — a vector with no spatial assignment that is explicitly trained to aggregate the entire scene. That clean global summary is why ViTs have become the preferred image encoder for CLIP and for the multimodal LLMs in Tab 3. Click any pixel or patch and use the controls to see how each architecture handles spatial information.

Convolutional Neural Network

Layer Input · no convolution

Click any pixel, then increase depth. The orange region shows the receptive field — every pixel contributing to that neuron's activation.

Vision Transformer (ViT)

Attends strongly to nearby patches — similar to what a CNN filter would do. ViTs can learn locality when it is useful, but they are not forced to.

Click any patch. The color intensity shows attention weight. Unlike CNNs, attention is global from the first layer — heads specialize, not layers.

CNN Inductive Biases

Locality means every neuron connects only to a small spatial neighborhood; the receptive field grows only as depth increases through composition of local windows. Shared filter weights enforce translation invariance: a feature detector fires identically wherever the feature appears. These structural priors are enormously helpful when training data are scarce. A global image summary can be extracted from a CNN — typically by average-pooling the final feature map — but this is a workaround, not a designed output. The architecture was built to produce spatial feature maps, and collapsing them into one vector discards the structured reasoning that makes transformers powerful at aggregation.

ViT and the [CLS] Token

A ViT splits the image into patches, embeds each as a token, and prepends a special [CLS] (classification) token with no spatial assignment. Through multi-head self-attention, [CLS] learns to aggregate information from every patch simultaneously. After the final layer, the [CLS] vector is a compact, position-independent embedding of the entire scene — exactly the shape a language model can work with. This is the token that CLIP and later multimodal models project into a joint image–text embedding space.

Performance vs. Training Data Scale

CNN's locality and translation-invariance priors help most at small scale. The ViT's unconstrained global attention becomes an advantage once data is plentiful — and ViTs scale better because they have no built-in spatial bottleneck.

A global image embedding is the bridge to Tab 2 →

Everything in the next tab depends on having a single embedding vector that represents an entire image. Any encoder can produce one — CLIP was released with both ResNet and ViT variants — but the ViT's [CLS] token is architecturally the cleanest solution: it is a vector with no spatial assignment, explicitly trained by the attention mechanism to aggregate the whole scene. CNNs require average-pooling a spatial feature map instead, a step that discards structural information the architecture never intended to compress into one vector. In practice, ViT-based CLIP models substantially outperform their ResNet counterparts at the same compute budget, and ViT encoders have become the standard for all major multimodal LLMs. The [CLS] vector that results — where cosine similarity measures semantic similarity — is what makes zero-shot classification and, ultimately, multimodal reasoning possible.

References

LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. doi:10.1162/neco.1989.1.4.541
He, K. et al. (2016). Deep residual learning for image recognition. Proc. CVPR 2016, 770–778. CVF · doi:10.1109/CVPR.2016.90
Dosovitskiy, A. et al. (2021). An image is worth 16×16 words. ICLR 2021. OpenReview
Caron, M. et al. (2021). Emerging properties in self-supervised vision transformers. ICCV 2021. CVF · doi:10.1109/ICCV48922.2021.00951

The [CLS] token from Tab 1 is a vector in a high-dimensional space — but so is a word embedding. CLIP (Contrastive Language–Image Pretraining, Radford et al. 2021) demonstrated that you can co-train a ViT image encoder and a text encoder so their outputs land at the same location whenever image and text describe the same concept. The resulting space has a remarkable property: cosine similarity measures semantic proximity. A photograph of a dog and the phrase "a dog" are angularly close; a photograph and "a red sports car" are angularly far apart. This geometry enables zero-shot classification — and, as Tab 3 shows, it makes multimodal reasoning tractable.

InfoNCE loss: per batch of N image–caption pairs, N embeddings are pulled together (positives on the diagonal) while N²−N cross-pairs are pushed apart — gradient updates flow back through both encoders simultaneously.

CLIP is itself a pair of transformers trained jointly. A ViT encodes images; a text transformer encodes captions. The ViT prepends a learnable [CLS] token to the patch sequence; its output embedding represents the whole image. The text encoder uses a causal language model (like GPT) and takes the embedding at the [EOS] (end-of-sequence) position instead — serving the same purpose as [CLS] but naturally suited to left-to-right architectures where no token precedes the sequence. Both encoders are trained end-to-end with a shared InfoNCE contrastive loss — neither is frozen. Gradients from the loss flow back through both networks simultaneously, shaping each encoder's internal representations until matching image–text pairs land at the same location in the joint embedding space. The shared geometry that results is what you are exploring below.

circle = image embedding square = text embedding dashed line = matched image–text pair

Selection Details

Click any point in the embedding space to see cosine similarity scores and rankings.

The [CLS] Token & CLIP Embeddings

CLIP's image encoder is a ViT. The vector it contributes to the joint embedding space is the [CLS] token from Tab 1 — a global, position-free summary of the entire image. The text encoder produces an analogous aggregate vector at the end of the text sequence. Co-training forces these two vectors to be geometrically aligned when image and caption describe the same thing. The result: any downstream task that can be framed as a similarity comparison — retrieval, classification, grounding — becomes a single vector dot product.

InfoNCE Loss

Training is driven by the InfoNCE (Noise-Contrastive Estimation) loss. Given a batch of N matched image–text pairs, the model computes all N×N pairwise cosine similarities. The loss maximizes the N diagonal entries (true matches) while simultaneously minimizing each of the N−1 off-diagonal entries per row and column. With a learned temperature τ:

L = −(1/N) Σᵢ log[ e^{(sim(iᵢ,tᵢ)/τ)} / Σⱼ e^{(sim(iᵢ,tⱼ)/τ)} ]

CLIP used batch sizes up to 32,768, flooding the loss with hard negatives and sharpening the learned geometry considerably.

The joint embedding space points toward Tab 3 →

Both paradigms here — text-to-text classification and text-to-image retrieval — treat the embedding space as a lookup: find the nearest neighbor and return a label. But what if you wanted to reason freely about the content of an image? Ask "is the dog in the foreground or background?" or "describe what's happening in this scene?" That requires a language model that can attend jointly over visual tokens and text. In Tab 3 you'll see how a single linear projection maps this embedding space into a language model's token vocabulary — making open-ended visual reasoning possible without changing the architectures you've already explored.

References

Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021, PMLR 139. PMLR
Chen, T. et al. (2020). A simple framework for contrastive learning of visual representations. ICML 2020, PMLR 119. PMLR
Dosovitskiy, A. et al. (2021). An image is worth 16×16 words. ICLR 2021. OpenReview

CLIP (2021) proved that a joint image–text embedding space was achievable. The next step — taken by models like LLaVA, BLIP-2, and GPT-4V (2022–2023) — was to ask: what if we fed visual information directly into a language model alongside text? The key is understanding what a language model actually processes. It never sees discrete words: each text subword is looked up in an embedding table and converted to a continuous vector (say, 4,096-D). Those vectors are what flow through the transformer's attention layers. The linear projection (shown below) maps each ViT patch embedding into that same 4,096-D space — producing continuous vectors with similar statistical properties to text token embeddings. It is the only component trained during multimodal adaptation: with both the ViT and the LLM frozen, its weights are learned via supervised next-word prediction on image–caption pairs, so that the LLM's generation loss flows back through the frozen LLM and updates the projection until the visual vectors it produces let the LLM generate accurate descriptions. The LLM then attends over the entire concatenated sequence — visual embedding vectors and text embedding vectors — using the same weights, with no awareness of which came from which source. No new architecture is required. Both input streams become the same kind of continuous vector before they reach the transformer.

1Input image, divided into patches

2Patch embedding vectors + [CLS] vector output by the ViT

3Linear Projection — the only component trained during multimodal adaptation

4Input text query

5Integer token IDs produced by the tokenizer

6Embedding Table — learned during LLM pretraining, frozen during multimodal adaptation

Trace the pipeline for a specific query

Select a scenario below to watch the full pipeline animate step by step — from raw image patches through the ViT encoder, linear projection, and joint reasoning in the LLM.

Select a scenario

Pipeline trace

What the linear projection actually does — and how it learns

It is a single learned matrix multiplication: each ViT patch embedding (e.g., 768-D) is multiplied by a (4,096 × 768) weight matrix, producing a 4,096-D vector. These are not "text tokens" in any vocabulary sense — they are continuous vectors in the same high-dimensional space that text token embeddings occupy. The LLM's transformer layers process any sequence of 4,096-D vectors; they cannot tell whether a vector came from the vocabulary lookup table or from the projection.

The projection is also the only component being trained during multimodal adaptation. The ViT (CLIP pretrained) is frozen. The entire LLM — including its embedding table — is frozen. Backpropagation on image–caption generation tasks sends gradients backward through the frozen LLM layers into the projected visual embeddings, and those gradients update the projection matrix. It learns to produce whatever continuous vectors make the frozen LLM generate accurate text about image content — not "find the nearest vocabulary token," but find the point in 4,096-D space that best serves the downstream task.

Beyond the adapter paradigm: native multimodal pretraining

The architecture shown above — frozen ViT + frozen LLM + thin learned projection — is sometimes called the adapter or late-fusion approach (exemplified by LLaVA, BLIP-2, and early GPT-4V). It is pedagogically clean: the visual and language components are trained separately, then connected by a small adapter trained on image–instruction data.

Frontier models have increasingly moved toward a different paradigm: native multimodal pretraining, where image patches and text tokens are treated as first-class inputs in a single joint context window from the start of pretraining. Rather than relying on an external CLIP encoder to produce visual embeddings that are then projected into the LLM, these models use their own patch embedding layer to convert image regions directly into embedding vectors — the same kind of continuous vectors that text tokens produce — and interleave them with text in one sequence during both pretraining and inference. Flamingo (2022) introduced large-scale image–text interleaving; Gemini (2023) and subsequent models were trained natively multimodal from scratch on this mixed data. This produces richer cross-modal grounding but requires vastly more compute and carefully curated interleaved datasets. The adapter approach in this tab remains widely used for smaller-scale systems and for fine-tuning existing LLMs on visual tasks.

References

Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021, PMLR 139. PMLR
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML 2023, PMLR 202, pp. 20351–20383. PMLR
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning (LLaVA). NeurIPS 2023 (Oral). NeurIPS Proceedings