Generative AI, explained visually
Generative AI is software that makes brand-new images, text, and sound by learning the patterns in millions of examples. This free visual course teaches how it works from scratch — GANs, VAEs, diffusion (the tech behind Midjourney and DALL·E), multimodal AI, and text-to-image — in nine short animated lessons. No math, no jargon.
What is a generative model?
A generative model is AI that creates brand-new things — images, text, music — instead of just sorting or labelling what already exists. It studies millions of examples until it learns their pattern well enough to make new ones that look like they belong.
Two kinds of AI: sorters and makers
Most AI you've met is a sorter: it looks at a photo and says "cat or dog?", or reads a review and says "happy or angry?". A generative model is a maker: give it the idea "a cat" and it produces a brand-new cat picture that never existed before. Same raw ingredient — data — but one labels it and the other creates from it.
Learning a style, not memorising
Think of an art student who studies thousands of Van Gogh paintings. They don't memorise one painting to copy it — they absorb the style: the swirls, the colours, the brushwork. Afterwards they can paint a new scene Van Gogh never painted, in his style. A generative model does the same with its training data: it learns the pattern, then makes new work that fits it.
The big idea
- Generative models learn the underlying pattern of their data, then sample new examples from it.
- They work for many kinds of data: images (Midjourney, DALL·E), text (ChatGPT), audio, and video.
- This course covers the main families: GANs, VAEs, and — the one behind today's image tools — diffusion.
Where we're headed
Over the next lessons we'll meet three ways to build a generative model — GANs, VAEs, and diffusion — then see how they team up with language to turn your words into pictures. Let's start with the most competitive one: GANs.
GANs: a forger vs a detective
A GAN (Generative Adversarial Network) trains two networks against each other: a generator that makes fakes and a discriminator that judges real from fake. As the detective gets sharper, the forger gets better — until the fakes look real.
Two networks, one contest
Picture a forger trying to paint fake money and a detective trying to spot it. Every round, the forger makes a batch of fakes; the detective judges each one real or fake and gets told the truth. The forger uses that feedback to fool the detective next time; the detective uses it to catch sharper fakes. Round after round, both improve — and the forger ends up making fakes good enough to pass.
One training round
- The generator turns random noise into a fake image.
- The discriminator sees a mix of real and fake images and guesses which is which.
- Both learn from the score: the generator to fool better, the discriminator to catch better.
Fast, sharp — but temperamental
GANs were the state of the art for years and produce crisp images fast (one pass, no slow steps). But the two-network fight is delicate: if the generator finds one type of fake that always fools the detective, it makes only that — a failure called mode collapse (imagine the forger only ever painting the same one banknote). Getting the balance right is finicky, which is one reason diffusion later took over.
Try it
Below, step the forger-vs-detective game and watch both scores climb as they learn from each other.
VAEs: compress, then recreate
A VAE (Variational Autoencoder) learns to squeeze an image down into a small list of numbers — a compact "idea" called a latent — and then rebuild the image from it. Sample a new idea, decode it, and you get a brand-new image. Stable to train; outputs a little soft.
An encoder and a decoder
A VAE has two halves. The encoder takes an image and squeezes it into a short list of numbers — the latent — that captures its essence (roughly: what's in it, its layout, its style). The decoder does the reverse: it takes that short list and rebuilds a full image. Train them together and the model learns a compact, meaningful summary of what images look like.
Like describing a face to a sketch artist
You can't send the whole photo, so you describe it in a few words — round face, curly hair, glasses. That short description is the latent. A good sketch artist (the decoder) turns those few words back into a face. Change the description a little — straight hair instead of curly — and you get a new but plausible face. That's how a VAE makes new images: tweak the idea, decode it.
Why VAEs still matter
- They give you a smooth, organised space of ideas — nearby latents make similar images.
- They're stable to train (no two-network fight like GANs).
- Their images can look a bit blurry — but that compress-to-a-latent trick is exactly what makes modern diffusion fast (next lessons).
Keep this in your pocket
The VAE's "work in a small latent space instead of full pixels" idea comes back in a big way when we reach Stable Diffusion. Now let's meet the technique behind today's best image tools: diffusion.
How diffusion models work
Diffusion models learn by destroying, then rebuilding. In training, they add noise to a real image bit by bit until it's pure static — then learn to reverse each step. To make a new image, they start from pure noise and undo it, guided toward what you asked for.
Learn to reverse the mess
Diffusion has a clever training trick. Take a real photo and add a little random static, then a little more, then more — after enough steps it's pure noise, like an untuned TV. At each step the model is shown "here's the noisier version; what noise did I just add?" By learning to predict the noise, it learns how to remove it — how to walk the process backwards.
The two directions
- Forward (training only): start from a real image, add noise step by step until it's pure static.
- Reverse (generation): start from pure static, remove a little noise each step, and a new image appears.
- The model never memorises photos — it learns the general skill of turning noise into something realistic.
A sculptor and a block of marble
A sculptor starts with a rough block and chips away, a little at a time, until a statue emerges. Diffusion starts with a block of noise and "chips away" the randomness step by step until an image emerges. It never carves the same statue twice — start from different noise and you get a different picture.
One key detail
The model doesn't predict the finished image in one shot — it predicts the small bit of noise to remove right now, and repeats. That patience is why diffusion images look so good. Next: watch that denoising happen, step by step.
Denoising, step by step
Generating an image is a series of small denoising steps. Starting from pure noise, the model predicts a little noise to remove, applies it, and repeats — often 20 to 1000 times — until a clear, detailed image emerges. Fewer steps is faster; more steps is sharper.
Many small steps beat one big guess
Instead of trying to paint the whole image at once, diffusion takes many gentle passes. Step one turns pure noise into a slightly-less-noisy blur. Step two cleans it a little more. By the last step the picture is sharp. Each step only has to make a small improvement, which is much easier than nailing the whole thing in one go — and it's why the results are so detailed.
What happens each step
- The model looks at the current noisy image and predicts the noise in it.
- It removes a portion of that predicted noise, revealing a slightly clearer image.
- Repeat: each pass sharpens the picture until the final step produces the finished image.
The speed–quality dial
The number of steps is a dial you can turn. A few dozen steps is fast and usually good enough; hundreds of steps squeeze out extra detail but take longer. Newer samplers get great results in far fewer steps — which is why image tools that once took a minute now feel almost instant.
Try it
Drag the slider below from pure noise to a finished image and watch the denoising happen — that's exactly what your prompt triggers behind the scenes.
What is multimodal AI?
Multimodal AI understands more than one kind of input at once — most often text and images together (sometimes audio and video too). It works by turning a picture into "image tokens" that a language model can read right alongside your words.
One model, many senses
A text-only model is like a brilliant friend on the phone — great with words, but blind to what you're pointing at. A multimodal model can also see. Show it a photo and ask a question about it, or hand it a chart and ask what it means, and it answers using both the picture and your words together — the way people naturally do.
How an image gets 'read'
- A vision encoder looks at the image and turns it into a sequence of numbers called image tokens.
- Those image tokens are placed next to your text tokens in the same input.
- The language model reasons over both at once — so it can answer questions that mix seeing and reading.
Where you've already used it
Modern assistants — as of 2026, e.g. GPT-4o, Claude, and Gemini — are multimodal: paste a screenshot and ask for the code, photograph a receipt and ask for the total, or share a diagram and ask what's wrong. Under the hood it's the same trick: images become tokens the model can think about alongside language.
The piece that makes it click
For a model to connect "this picture" with "these words," both need to live in one shared space of meaning. The model that pioneered that is CLIP — next lesson.
How text-to-image works
Text-to-image joins the pieces from this course. Your prompt is turned into a target using a CLIP-like text encoder; a diffusion model then denoises pure noise step by step, steered at every step toward an image that matches your words.
Every piece, working together
Now the course clicks into place. Diffusion (lessons 4–5) knows how to turn noise into a realistic image. CLIP (lesson 7) knows how to turn your words into a point in a shared space. Text-to-image bolts them together: your prompt becomes a target, and the diffusion model denoises toward it — checking against your words at every step.
From prompt to picture
- A text encoder turns your prompt into a target in the shared image-text space.
- The diffusion model starts from pure noise and denoises step by step (lesson 5).
- At each step it's nudged toward the prompt's target, so the emerging image matches your words.
- For speed, tools like Stable Diffusion run this in a small VAE latent space (lesson 3), not full pixels.
Why prompts sometimes miss
This also explains the quirks. Diffusion learns patterns statistically, not physics — so it can fumble hands, exact object counts, readable text in the image, or precise "left of / behind" placement. Clearer, more specific prompts steer it better, but these remain the known weak spots as of 2026.
Try it
Pick a prompt below and watch noise resolve into a matching image — the whole pipeline, in miniature.
Where generative AI is used
Generative AI now powers image tools (Midjourney, DALL·E, Imagen), design and marketing, video and audio creation, and multimodal assistants (GPT-4o, Claude, Gemini) that can see and read at once. The same building blocks from this course sit under all of them.
Real-world uses
- Image creation — Midjourney, DALL·E, Stable Diffusion, Imagen turn prompts into art, mockups, and product shots.
- Design & marketing — logos, ad variations, storyboards, and concept art in seconds.
- Video & audio — text-to-video and voice/music generation build on the same diffusion ideas.
- Multimodal assistants — GPT-4o, Claude, Gemini answer questions about images, charts, and screenshots.
The same blocks, everywhere
Behind this variety is a small set of ideas you now know: a generative model that learns a pattern; diffusion that denoises noise into images; CLIP-style shared spaces that link words and pictures; and multimodal models that read text and images together. New tools mostly recombine these blocks.
You've finished the course
You can now explain generative AI end to end: what a generative model is, how GANs and VAEs make images, how diffusion denoises noise into pictures, how multimodal AI and CLIP link words with images, and how text-to-image ties it together. Ready to go deeper? The links below continue the journey.
You've got the whole picture
An LLM predicts words and can hallucinate; embeddings turn meaning into vectors; similarity finds the closest ones; retrieval fetches them; and RAG feeds them to the model to ground its answer.
Explore the full question bank →