What is Multimodal AI? Text + Images, Explained

One model, many senses

A text-only model is like a brilliant friend on the phone — great with words, but blind to what you're pointing at. A multimodal model can also see. Show it a photo and ask a question about it, or hand it a chart and ask what it means, and it answers using both the picture and your words together — the way people naturally do.

How an image gets 'read'

A vision encoder looks at the image and turns it into a sequence of numbers called image tokens.
Those image tokens are placed next to your text tokens in the same input.
The language model reasons over both at once — so it can answer questions that mix seeing and reading.

Where you've already used it

Modern assistants — as of 2026, e.g. GPT-4o, Claude, and Gemini — are multimodal: paste a screenshot and ask for the code, photograph a receipt and ask for the total, or share a diagram and ask what's wrong. Under the hood it's the same trick: images become tokens the model can think about alongside language.

The piece that makes it click

For a model to connect "this picture" with "these words," both need to live in one shared space of meaning. The model that pioneered that is CLIP — next lesson.

A vision encoder turns the image into tokens the language model reads with your text.

What is multimodal AI?

One model, many senses

How an image gets 'read'

Where you've already used it

The piece that makes it click