Autoencoder Explorer — interactive explainer

The "Bowtie" Architecture

Layer Specification (this widget)

Role	Layer	Units	Activation
Input	—	784	—
Encoder	Dense 1	128	ReLU
Encoder	Dense 2	32	ReLU
Bottleneck	Dense 3	2	Linear
Decoder	Dense 4	32	ReLU
Decoder	Dense 5	128	ReLU
Output	Dense 6	784	Sigmoid

Instructional constraint: Forcing the bottleneck to 2 dimensions lets the encoding space be visualized directly as a scatter plot with no secondary dimensionality reduction. In practice autoencoders use much wider bottlenecks (16–512 units). This is analogous to truncating a PCA to exactly 2 principal components: interpretable and plottable, but lossy.

Key Concepts

The encoder compresses each 28×28 digit image (784 pixel values) to just 2 numbers, (z₁, z₂) ∈ ℝ². The decoder reconstructs the original 784-pixel image from those two numbers alone. Mean binary cross-entropy between input and reconstruction is the training signal — no class labels are ever provided.

After training, the encoder alone is the useful product. The decoder is retained only to interpret what a given code "looks like." The 2-D encoding space can be colored by digit class (labels used only for coloring, not during training) to reveal whether the unsupervised compression captured class structure.

Unsupervised, not self-supervised. All training images are provided upfront; the reconstruction target is the input itself. Self-supervised learning synthesizes pseudo-labels from unlabeled data. Here "auto" in autoencoder means the input serves as its own supervision signal — not that data is generated synthetically.

Relationship to PCA. A linear autoencoder with an n-D bottleneck is mathematically equivalent to truncated PCA retaining n principal components. The nonlinear version here generalizes PCA to curved manifolds in the input space, though individual axes lose interpretability.

Loss Function: Binary Cross-Entropy (BCE)

The decoder's final layer is a sigmoid, so each reconstructed pixel x̂_i ∈ (0,1) is a predicted probability that pixel i is "on." The input pixel x_i ∈ [0,1] is its normalized intensity (raw value ÷ 255). These synthetic digit images are mostly black (0) or white (1) with a small amount of added noise, so each x_i acts as a soft binary target. BCE measures how well each sigmoid prediction matches its target over all N = 784 pixels (the full 28×28 input and output):

L = −(1/N) ∑_i [ x_i ln x̂_i + (1−x_i) ln(1−x̂_i) ]

This is the cross-entropy H(x, x̂) = −x ln x̂ − (1−x) ln(1−x̂) for Bernoulli variables, averaged over all pixels. Shannon entropy H(p) = −p ln p − (1−p) ln(1−p) is the special case where prediction equals target and gives the theoretical minimum loss.

Natural log, not log base 2. Machine learning conventionally uses the natural logarithm (ln, base e) in loss functions because its derivative 1/x simplifies gradient calculations. The unit is nats rather than bits. This is why the baseline loss for a uniform predictor (x̂ ≈ 0.5) is ln 2 ≈ 0.693 nats — not 1 bit. TensorFlow's binaryCrossentropy uses natural log internally.

Baseline & what to watch: A randomly-initialized network outputs x̂ ≈ 0.5, giving L ≈ ln 2 ≈ 0.693. As training proceeds the loss should drop well below this. Because most digit pixels are black (≈0), a network that outputs 0 everywhere achieves L ≈ 0.1 — so loss alone is not informative; watch the reconstruction images to judge quality.

This loss is specific to this task. BCE is appropriate here because pixel values in [0,1] pair naturally with the sigmoid output and admit a Bernoulli probability interpretation. It is not a universal property of autoencoders. For continuous-valued inputs (e.g., sensor readings, spectral data) mean squared error (MSE) is the more natural reconstruction loss; for structured or categorical outputs other losses apply. The autoencoder architecture is agnostic to the choice — it is the nature of the data and the decoder’s output activation that determine which loss makes sense.

Wider Bottlenecks, t-SNE, and UMAP

The 2-D bottleneck in this widget is an instructional choice made so the encoding space can be plotted directly. In practice, autoencoders use bottlenecks of 32–512 dimensions — wide enough to reconstruct faithfully, but far too high-dimensional to visualize directly. This is where t-SNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) enter. Both are nonlinear dimensionality reduction methods that project a high-dimensional cloud of points into 2-D while approximately preserving local neighborhood structure.

A standard pipeline is: train an autoencoder to a wide bottleneck, then apply t-SNE or UMAP to the encodings for visualization. This works better than applying t-SNE or UMAP directly to raw inputs (e.g., pixels) because the encoder has already removed low-level noise and compressed the data into a representation that emphasizes whatever structure was necessary for reconstruction. t-SNE and UMAP then operate on a much cleaner, lower-dimensional signal.

In this framing, t-SNE or UMAP is a task-specific head attached after the encoder — used at evaluation time to understand what the encoder learned, not during training. The encoder itself remains general-purpose and transferable: the same bottleneck representation can feed a classifier, a retrieval system, an anomaly detector, or a downstream generative model, whereas a direct t-SNE projection of raw data produces only a visualization with no reusable intermediate representation.

Why not just use t-SNE or UMAP directly? You can, and for pure visualization it is often fine. The argument for the autoencoder-first approach is: (1) the encoder is trained with a reconstruction objective that preserves all the information needed to recover the input, not just whatever structure happens to separate neighbors; (2) the learned representation is deterministic and fast to evaluate at inference time — t-SNE is neither; (3) the bottleneck can be reused for tasks beyond visualization without retraining. UMAP is faster and more scalable than t-SNE and does produce an out-of-sample transform, so the gap is narrower there — but the autoencoder's representation still tends to be more structured when the data has clear generative factors.