Skip to content
Lesson 7 · Multimodal

CLIP: one shared space for words and pictures

CLIP is a model trained on millions of image-and-caption pairs so that a picture and the words describing it land at the same spot in one shared space. That lets AI match an image to text, search photos by description, and steer image generators — all with the same closeness idea behind semantic search.

Scroll

Teach pictures and words to agree

CLIP (introduced by Radford and colleagues at OpenAI in 2021) is shown huge numbers of image-caption pairs — a photo of a beach with the caption "a sunny beach," and so on. It's trained with a simple rule: pull each image and its true caption close together in a shared space, and push mismatched pairs apart. After enough pairs, any photo and its correct description end up as near-identical points.

Why one shared space is so powerful

Once images and text live in the same space, closeness means "these match." So you can hand CLIP a photo and a few candidate captions and it picks the closest one; or type a description and it finds the nearest photos. It's the exact same measure-the-closeness trick from semantic search — just spanning pictures and words instead of only text.

What CLIP unlocks

  • Zero-shot labelling: name an image without training a special classifier for it.
  • Search images by a text description (or find captions for an image).
  • Steering image generators: a text prompt becomes a target in this space that diffusion aims for.

Try it

Pick an image and a caption below and watch the match score — high when they mean the same thing, low when they don't. Then we'll combine CLIP with diffusion to turn words into pictures.

InteractivePick an image and a caption — the score is high when they mean the same thing.
Next: how text-to-image works →