Skip to content
Learn · visual course

Generative AI, explained visually

Generative AI is software that makes brand-new images, text, and sound by learning the patterns in millions of examples. This free visual course teaches how it works from scratch — GANs, VAEs, diffusion (the tech behind Midjourney and DALL·E), multimodal AI, and text-to-image — in nine short animated lessons. No math, no jargon.

9 lessons34 min to read100% free
Scroll to begin
Lesson 1 · Foundations

What is a generative model?

A generative model is AI that creates brand-new things — images, text, music — instead of just sorting or labelling what already exists. It studies millions of examples until it learns their pattern well enough to make new ones that look like they belong.

Two kinds of AI: sorters and makers

Most AI you've met is a sorter: it looks at a photo and says "cat or dog?", or reads a review and says "happy or angry?". A generative model is a maker: give it the idea "a cat" and it produces a brand-new cat picture that never existed before. Same raw ingredient — data — but one labels it and the other creates from it.

Learning a style, not memorising

Think of an art student who studies thousands of Van Gogh paintings. They don't memorise one painting to copy it — they absorb the style: the swirls, the colours, the brushwork. Afterwards they can paint a new scene Van Gogh never painted, in his style. A generative model does the same with its training data: it learns the pattern, then makes new work that fits it.

The big idea

  • Generative models learn the underlying pattern of their data, then sample new examples from it.
  • They work for many kinds of data: images (Midjourney, DALL·E), text (ChatGPT), audio, and video.
  • This course covers the main families: GANs, VAEs, and — the one behind today's image tools — diffusion.

Where we're headed

Over the next lessons we'll meet three ways to build a generative model — GANs, VAEs, and diffusion — then see how they team up with language to turn your words into pictures. Let's start with the most competitive one: GANs.

A sorter labels existing data; a generative model makes new data.
Lesson 2 · The classic makers

GANs: a forger vs a detective

A GAN (Generative Adversarial Network) trains two networks against each other: a generator that makes fakes and a discriminator that judges real from fake. As the detective gets sharper, the forger gets better — until the fakes look real.

Two networks, one contest

Picture a forger trying to paint fake money and a detective trying to spot it. Every round, the forger makes a batch of fakes; the detective judges each one real or fake and gets told the truth. The forger uses that feedback to fool the detective next time; the detective uses it to catch sharper fakes. Round after round, both improve — and the forger ends up making fakes good enough to pass.

One training round

  1. The generator turns random noise into a fake image.
  2. The discriminator sees a mix of real and fake images and guesses which is which.
  3. Both learn from the score: the generator to fool better, the discriminator to catch better.

Fast, sharp — but temperamental

GANs were the state of the art for years and produce crisp images fast (one pass, no slow steps). But the two-network fight is delicate: if the generator finds one type of fake that always fools the detective, it makes only that — a failure called mode collapse (imagine the forger only ever painting the same one banknote). Getting the balance right is finicky, which is one reason diffusion later took over.

Try it

Below, step the forger-vs-detective game and watch both scores climb as they learn from each other.

InteractiveStep the game: the forger makes fakes, the detective judges — watch both improve.
Lesson 3 · The classic makers

VAEs: compress, then recreate

A VAE (Variational Autoencoder) learns to squeeze an image down into a small list of numbers — a compact "idea" called a latent — and then rebuild the image from it. Sample a new idea, decode it, and you get a brand-new image. Stable to train; outputs a little soft.

An encoder and a decoder

A VAE has two halves. The encoder takes an image and squeezes it into a short list of numbers — the latent — that captures its essence (roughly: what's in it, its layout, its style). The decoder does the reverse: it takes that short list and rebuilds a full image. Train them together and the model learns a compact, meaningful summary of what images look like.

Like describing a face to a sketch artist

You can't send the whole photo, so you describe it in a few words — round face, curly hair, glasses. That short description is the latent. A good sketch artist (the decoder) turns those few words back into a face. Change the description a little — straight hair instead of curly — and you get a new but plausible face. That's how a VAE makes new images: tweak the idea, decode it.

Why VAEs still matter

  • They give you a smooth, organised space of ideas — nearby latents make similar images.
  • They're stable to train (no two-network fight like GANs).
  • Their images can look a bit blurry — but that compress-to-a-latent trick is exactly what makes modern diffusion fast (next lessons).

Keep this in your pocket

The VAE's "work in a small latent space instead of full pixels" idea comes back in a big way when we reach Stable Diffusion. Now let's meet the technique behind today's best image tools: diffusion.

Encoder squeezes an image into a latent; decoder rebuilds it. New latent → new image.
Lesson 4 · Diffusion

How diffusion models work

Diffusion models learn by destroying, then rebuilding. In training, they add noise to a real image bit by bit until it's pure static — then learn to reverse each step. To make a new image, they start from pure noise and undo it, guided toward what you asked for.

Learn to reverse the mess

Diffusion has a clever training trick. Take a real photo and add a little random static, then a little more, then more — after enough steps it's pure noise, like an untuned TV. At each step the model is shown "here's the noisier version; what noise did I just add?" By learning to predict the noise, it learns how to remove it — how to walk the process backwards.

The two directions

  1. Forward (training only): start from a real image, add noise step by step until it's pure static.
  2. Reverse (generation): start from pure static, remove a little noise each step, and a new image appears.
  3. The model never memorises photos — it learns the general skill of turning noise into something realistic.

A sculptor and a block of marble

A sculptor starts with a rough block and chips away, a little at a time, until a statue emerges. Diffusion starts with a block of noise and "chips away" the randomness step by step until an image emerges. It never carves the same statue twice — start from different noise and you get a different picture.

One key detail

The model doesn't predict the finished image in one shot — it predicts the small bit of noise to remove right now, and repeats. That patience is why diffusion images look so good. Next: watch that denoising happen, step by step.

Forward adds noise (training); reverse removes it, step by step (generation).
Lesson 5 · Diffusion

Denoising, step by step

Generating an image is a series of small denoising steps. Starting from pure noise, the model predicts a little noise to remove, applies it, and repeats — often 20 to 1000 times — until a clear, detailed image emerges. Fewer steps is faster; more steps is sharper.

Many small steps beat one big guess

Instead of trying to paint the whole image at once, diffusion takes many gentle passes. Step one turns pure noise into a slightly-less-noisy blur. Step two cleans it a little more. By the last step the picture is sharp. Each step only has to make a small improvement, which is much easier than nailing the whole thing in one go — and it's why the results are so detailed.

What happens each step

  1. The model looks at the current noisy image and predicts the noise in it.
  2. It removes a portion of that predicted noise, revealing a slightly clearer image.
  3. Repeat: each pass sharpens the picture until the final step produces the finished image.

The speed–quality dial

The number of steps is a dial you can turn. A few dozen steps is fast and usually good enough; hundreds of steps squeeze out extra detail but take longer. Newer samplers get great results in far fewer steps — which is why image tools that once took a minute now feel almost instant.

Try it

Drag the slider below from pure noise to a finished image and watch the denoising happen — that's exactly what your prompt triggers behind the scenes.

InteractiveDrag the timestep from pure noise (left) to a clear image (right).
Lesson 6 · Multimodal

What is multimodal AI?

Multimodal AI understands more than one kind of input at once — most often text and images together (sometimes audio and video too). It works by turning a picture into "image tokens" that a language model can read right alongside your words.

One model, many senses

A text-only model is like a brilliant friend on the phone — great with words, but blind to what you're pointing at. A multimodal model can also see. Show it a photo and ask a question about it, or hand it a chart and ask what it means, and it answers using both the picture and your words together — the way people naturally do.

How an image gets 'read'

  1. A vision encoder looks at the image and turns it into a sequence of numbers called image tokens.
  2. Those image tokens are placed next to your text tokens in the same input.
  3. The language model reasons over both at once — so it can answer questions that mix seeing and reading.

Where you've already used it

Modern assistants — as of 2026, e.g. GPT-4o, Claude, and Gemini — are multimodal: paste a screenshot and ask for the code, photograph a receipt and ask for the total, or share a diagram and ask what's wrong. Under the hood it's the same trick: images become tokens the model can think about alongside language.

The piece that makes it click

For a model to connect "this picture" with "these words," both need to live in one shared space of meaning. The model that pioneered that is CLIP — next lesson.

A vision encoder turns the image into tokens the language model reads with your text.
Lesson 7 · Multimodal

CLIP: one shared space for words and pictures

CLIP is a model trained on millions of image-and-caption pairs so that a picture and the words describing it land at the same spot in one shared space. That lets AI match an image to text, search photos by description, and steer image generators — all with the same closeness idea behind semantic search.

Teach pictures and words to agree

CLIP (introduced by Radford and colleagues at OpenAI in 2021) is shown huge numbers of image-caption pairs — a photo of a beach with the caption "a sunny beach," and so on. It's trained with a simple rule: pull each image and its true caption close together in a shared space, and push mismatched pairs apart. After enough pairs, any photo and its correct description end up as near-identical points.

Why one shared space is so powerful

Once images and text live in the same space, closeness means "these match." So you can hand CLIP a photo and a few candidate captions and it picks the closest one; or type a description and it finds the nearest photos. It's the exact same measure-the-closeness trick from semantic search — just spanning pictures and words instead of only text.

What CLIP unlocks

  • Zero-shot labelling: name an image without training a special classifier for it.
  • Search images by a text description (or find captions for an image).
  • Steering image generators: a text prompt becomes a target in this space that diffusion aims for.

Try it

Pick an image and a caption below and watch the match score — high when they mean the same thing, low when they don't. Then we'll combine CLIP with diffusion to turn words into pictures.

InteractivePick an image and a caption — the score is high when they mean the same thing.
Lesson 8 · Putting it together

How text-to-image works

Text-to-image joins the pieces from this course. Your prompt is turned into a target using a CLIP-like text encoder; a diffusion model then denoises pure noise step by step, steered at every step toward an image that matches your words.

Every piece, working together

Now the course clicks into place. Diffusion (lessons 4–5) knows how to turn noise into a realistic image. CLIP (lesson 7) knows how to turn your words into a point in a shared space. Text-to-image bolts them together: your prompt becomes a target, and the diffusion model denoises toward it — checking against your words at every step.

From prompt to picture

  1. A text encoder turns your prompt into a target in the shared image-text space.
  2. The diffusion model starts from pure noise and denoises step by step (lesson 5).
  3. At each step it's nudged toward the prompt's target, so the emerging image matches your words.
  4. For speed, tools like Stable Diffusion run this in a small VAE latent space (lesson 3), not full pixels.

Why prompts sometimes miss

This also explains the quirks. Diffusion learns patterns statistically, not physics — so it can fumble hands, exact object counts, readable text in the image, or precise "left of / behind" placement. Clearer, more specific prompts steer it better, but these remain the known weak spots as of 2026.

Try it

Pick a prompt below and watch noise resolve into a matching image — the whole pipeline, in miniature.

InteractivePick a prompt and watch pure noise denoise into a matching image.
Lesson 9 · Putting it together

Where generative AI is used

Generative AI now powers image tools (Midjourney, DALL·E, Imagen), design and marketing, video and audio creation, and multimodal assistants (GPT-4o, Claude, Gemini) that can see and read at once. The same building blocks from this course sit under all of them.

Real-world uses

  • Image creation — Midjourney, DALL·E, Stable Diffusion, Imagen turn prompts into art, mockups, and product shots.
  • Design & marketing — logos, ad variations, storyboards, and concept art in seconds.
  • Video & audio — text-to-video and voice/music generation build on the same diffusion ideas.
  • Multimodal assistants — GPT-4o, Claude, Gemini answer questions about images, charts, and screenshots.

The same blocks, everywhere

Behind this variety is a small set of ideas you now know: a generative model that learns a pattern; diffusion that denoises noise into images; CLIP-style shared spaces that link words and pictures; and multimodal models that read text and images together. New tools mostly recombine these blocks.

You've finished the course

You can now explain generative AI end to end: what a generative model is, how GANs and VAEs make images, how diffusion denoises noise into pictures, how multimodal AI and CLIP link words with images, and how text-to-image ties it together. Ready to go deeper? The links below continue the journey.

One toolkit — generative models, diffusion, CLIP — behind many everyday AI tools.

You've got the whole picture

An LLM predicts words and can hallucinate; embeddings turn meaning into vectors; similarity finds the closest ones; retrieval fetches them; and RAG feeds them to the model to ground its answer.

Explore the full question bank →