Skip to content
Lesson 3 · The classic makers

VAEs: compress, then recreate

A VAE (Variational Autoencoder) learns to squeeze an image down into a small list of numbers — a compact "idea" called a latent — and then rebuild the image from it. Sample a new idea, decode it, and you get a brand-new image. Stable to train; outputs a little soft.

Scroll

An encoder and a decoder

A VAE has two halves. The encoder takes an image and squeezes it into a short list of numbers — the latent — that captures its essence (roughly: what's in it, its layout, its style). The decoder does the reverse: it takes that short list and rebuilds a full image. Train them together and the model learns a compact, meaningful summary of what images look like.

Like describing a face to a sketch artist

You can't send the whole photo, so you describe it in a few words — round face, curly hair, glasses. That short description is the latent. A good sketch artist (the decoder) turns those few words back into a face. Change the description a little — straight hair instead of curly — and you get a new but plausible face. That's how a VAE makes new images: tweak the idea, decode it.

Why VAEs still matter

  • They give you a smooth, organised space of ideas — nearby latents make similar images.
  • They're stable to train (no two-network fight like GANs).
  • Their images can look a bit blurry — but that compress-to-a-latent trick is exactly what makes modern diffusion fast (next lessons).

Keep this in your pocket

The VAE's "work in a small latent space instead of full pixels" idea comes back in a big way when we reach Stable Diffusion. Now let's meet the technique behind today's best image tools: diffusion.

Encoder squeezes an image into a latent; decoder rebuilds it. New latent → new image.
Next: how diffusion works →