An encoder and a decoder
A VAE has two halves. The encoder takes an image and squeezes it into a short list of numbers — the latent — that captures its essence (roughly: what's in it, its layout, its style). The decoder does the reverse: it takes that short list and rebuilds a full image. Train them together and the model learns a compact, meaningful summary of what images look like.
Like describing a face to a sketch artist
You can't send the whole photo, so you describe it in a few words — round face, curly hair, glasses. That short description is the latent. A good sketch artist (the decoder) turns those few words back into a face. Change the description a little — straight hair instead of curly — and you get a new but plausible face. That's how a VAE makes new images: tweak the idea, decode it.
Why VAEs still matter
- They give you a smooth, organised space of ideas — nearby latents make similar images.
- They're stable to train (no two-network fight like GANs).
- Their images can look a bit blurry — but that compress-to-a-latent trick is exactly what makes modern diffusion fast (next lessons).
Keep this in your pocket
The VAE's "work in a small latent space instead of full pixels" idea comes back in a big way when we reach Stable Diffusion. Now let's meet the technique behind today's best image tools: diffusion.