Teach pictures and words to agree
CLIP (introduced by Radford and colleagues at OpenAI in 2021) is shown huge numbers of image-caption pairs — a photo of a beach with the caption "a sunny beach," and so on. It's trained with a simple rule: pull each image and its true caption close together in a shared space, and push mismatched pairs apart. After enough pairs, any photo and its correct description end up as near-identical points.
Why one shared space is so powerful
Once images and text live in the same space, closeness means "these match." So you can hand CLIP a photo and a few candidate captions and it picks the closest one; or type a description and it finds the nearest photos. It's the exact same measure-the-closeness trick from semantic search — just spanning pictures and words instead of only text.
What CLIP unlocks
- Zero-shot labelling: name an image without training a special classifier for it.
- Search images by a text description (or find captions for an image).
- Steering image generators: a text prompt becomes a target in this space that diffusion aims for.
Try it
Pick an image and a caption below and watch the match score — high when they mean the same thing, low when they don't. Then we'll combine CLIP with diffusion to turn words into pictures.