Skip to content
Lesson 8 · Putting it together

How text-to-image works

Text-to-image joins the pieces from this course. Your prompt is turned into a target using a CLIP-like text encoder; a diffusion model then denoises pure noise step by step, steered at every step toward an image that matches your words.

Scroll

Every piece, working together

Now the course clicks into place. Diffusion (lessons 4–5) knows how to turn noise into a realistic image. CLIP (lesson 7) knows how to turn your words into a point in a shared space. Text-to-image bolts them together: your prompt becomes a target, and the diffusion model denoises toward it — checking against your words at every step.

From prompt to picture

  1. A text encoder turns your prompt into a target in the shared image-text space.
  2. The diffusion model starts from pure noise and denoises step by step (lesson 5).
  3. At each step it's nudged toward the prompt's target, so the emerging image matches your words.
  4. For speed, tools like Stable Diffusion run this in a small VAE latent space (lesson 3), not full pixels.

Why prompts sometimes miss

This also explains the quirks. Diffusion learns patterns statistically, not physics — so it can fumble hands, exact object counts, readable text in the image, or precise "left of / behind" placement. Clearer, more specific prompts steer it better, but these remain the known weak spots as of 2026.

Try it

Pick a prompt below and watch noise resolve into a matching image — the whole pipeline, in miniature.

InteractivePick a prompt and watch pure noise denoise into a matching image.
Next: where it's all used →