What is a VAE? Compress-and-Recreate, Explained

An encoder and a decoder

A VAE has two halves. The encoder takes an image and squeezes it into a short list of numbers — the latent — that captures its essence (roughly: what's in it, its layout, its style). The decoder does the reverse: it takes that short list and rebuilds a full image. Train them together and the model learns a compact, meaningful summary of what images look like.

Like describing a face to a sketch artist

You can't send the whole photo, so you describe it in a few words — round face, curly hair, glasses. That short description is the latent. A good sketch artist (the decoder) turns those few words back into a face. Change the description a little — straight hair instead of curly — and you get a new but plausible face. That's how a VAE makes new images: tweak the idea, decode it.

Why VAEs still matter

They give you a smooth, organised space of ideas — nearby latents make similar images.
They're stable to train (no two-network fight like GANs).
Their images can look a bit blurry — but that compress-to-a-latent trick is exactly what makes modern diffusion fast (next lessons).

Keep this in your pocket

The VAE's "work in a small latent space instead of full pixels" idea comes back in a big way when we reach Stable Diffusion. Now let's meet the technique behind today's best image tools: diffusion.

Encoder squeezes an image into a latent; decoder rebuilds it. New latent → new image.

VAEs: compress, then recreate

An encoder and a decoder

Like describing a face to a sketch artist

Why VAEs still matter

Keep this in your pocket