Skip to content
Lesson 6 · Multimodal

What is multimodal AI?

Multimodal AI understands more than one kind of input at once — most often text and images together (sometimes audio and video too). It works by turning a picture into "image tokens" that a language model can read right alongside your words.

Scroll

One model, many senses

A text-only model is like a brilliant friend on the phone — great with words, but blind to what you're pointing at. A multimodal model can also see. Show it a photo and ask a question about it, or hand it a chart and ask what it means, and it answers using both the picture and your words together — the way people naturally do.

How an image gets 'read'

  1. A vision encoder looks at the image and turns it into a sequence of numbers called image tokens.
  2. Those image tokens are placed next to your text tokens in the same input.
  3. The language model reasons over both at once — so it can answer questions that mix seeing and reading.

Where you've already used it

Modern assistants — as of 2026, e.g. GPT-4o, Claude, and Gemini — are multimodal: paste a screenshot and ask for the code, photograph a receipt and ask for the total, or share a diagram and ask what's wrong. Under the hood it's the same trick: images become tokens the model can think about alongside language.

The piece that makes it click

For a model to connect "this picture" with "these words," both need to live in one shared space of meaning. The model that pioneered that is CLIP — next lesson.

A vision encoder turns the image into tokens the language model reads with your text.
Next: what is CLIP? →