Every piece, working together
Now the course clicks into place. Diffusion (lessons 4–5) knows how to turn noise into a realistic image. CLIP (lesson 7) knows how to turn your words into a point in a shared space. Text-to-image bolts them together: your prompt becomes a target, and the diffusion model denoises toward it — checking against your words at every step.
From prompt to picture
- A text encoder turns your prompt into a target in the shared image-text space.
- The diffusion model starts from pure noise and denoises step by step (lesson 5).
- At each step it's nudged toward the prompt's target, so the emerging image matches your words.
- For speed, tools like Stable Diffusion run this in a small VAE latent space (lesson 3), not full pixels.
Why prompts sometimes miss
This also explains the quirks. Diffusion learns patterns statistically, not physics — so it can fumble hands, exact object counts, readable text in the image, or precise "left of / behind" placement. Clearer, more specific prompts steer it better, but these remain the known weak spots as of 2026.
Try it
Pick a prompt below and watch noise resolve into a matching image — the whole pipeline, in miniature.