What is CLIP? Images + Text, Explained

Teach pictures and words to agree

CLIP (introduced by Radford and colleagues at OpenAI in 2021) is shown huge numbers of image-caption pairs — a photo of a beach with the caption "a sunny beach," and so on. It's trained with a simple rule: pull each image and its true caption close together in a shared space, and push mismatched pairs apart. After enough pairs, any photo and its correct description end up as near-identical points.

Why one shared space is so powerful

Once images and text live in the same space, closeness means "these match." So you can hand CLIP a photo and a few candidate captions and it picks the closest one; or type a description and it finds the nearest photos. It's the exact same measure-the-closeness trick from semantic search — just spanning pictures and words instead of only text.

What CLIP unlocks

Zero-shot labelling: name an image without training a special classifier for it.
Search images by a text description (or find captions for an image).
Steering image generators: a text prompt becomes a target in this space that diffusion aims for.

Try it

Pick an image and a caption below and watch the match score — high when they mean the same thing, low when they don't. Then we'll combine CLIP with diffusion to turn words into pictures.

InteractivePick an image and a caption — the score is high when they mean the same thing.

CLIP: one shared space for words and pictures

Teach pictures and words to agree

Why one shared space is so powerful

What CLIP unlocks

Try it