RAG, explained visually — from zero
RAG (Retrieval-Augmented Generation) lets an AI look things up before it answers, so it stays accurate and current instead of guessing. This free visual course teaches it from scratch: what an LLM is, why it hallucinates, what vectors and embeddings are, and how Retrieve → Augment → Generate fits together. No math, no jargon — just eight short animated lessons.
What is an LLM?
A large language model (LLM) is a program trained on huge amounts of text to do one deceptively simple thing: predict the next word. Chatbots like ChatGPT are LLMs. They are powerful, but their knowledge is frozen at the moment they were trained — they can't see anything new on their own.
It's autocomplete, scaled up enormously
You already know a tiny language model: the autocomplete on your phone. Type "see you" and it suggests "later." An LLM is that same idea trained on a very large slice of the internet, books, and code — so instead of finishing a text message, it can finish an essay, write code, or answer a question. Under the hood it is always doing the same move: given the words so far, what word most likely comes next?
The developer's version
Think of an LLM as a function: text in, text out. You don't call methods on it; you describe what you want in plain language (the "prompt") and it returns a best-guess continuation. There is no database query, no lookup — the answer is generated one token at a time from patterns it learned during training.
Three things to remember
- An LLM predicts the next token (a token is roughly a word or word-piece), over and over, to build a response.
- Its "knowledge" is baked in during training — like a snapshot taken on a certain date. It does not automatically know today's news or your private documents.
- It is confident by design. It will produce a fluent answer even when it is wrong — which is exactly the problem the next lesson is about.
Why this matters for RAG
RAG exists to fix the last two points — the frozen, private-blind knowledge and the confident guessing. To understand RAG, you first need to feel why an LLM alone is not enough. That's next.
Why AI hallucinates
An AI "hallucination" is when a language model states something false as if it were true. It happens because the model is always guessing the next word — and when it doesn't actually know, it guesses anyway, fluently and confidently. The cure isn't a smarter guess; it's letting the model look things up.
Confidence is not knowledge
An LLM has no built-in sense of "I don't know." Its whole job is to produce a plausible next word. Ask it about something it never saw during training — a niche topic, last week's event, your company's internal policy — and it will still generate a smooth, authoritative answer. Sometimes that answer is invented. That's a hallucination.
The exam analogy
Imagine a student taking a closed-book exam on a subject they only half-studied. They won't leave blanks — they'll write something confident to fill the space. Now hand the same student the textbook and let them look up the answer before writing. The confident-but-wrong answers mostly disappear. RAG is handing the model the textbook.
The three gaps that cause hallucination
- Frozen knowledge: the model can't know anything after its training cutoff.
- No private data: it never saw your documents, your product, or your customer records.
- No source of truth: it can't tell the difference between a fact it learned and a pattern it's improvising.
The fix, in one sentence
Instead of asking the model to remember everything, we let it retrieve the right information at the moment of the question — and that requires a way to find relevant text by meaning. To do that, we first turn words into numbers. That's the next lesson.
What are vectors and embeddings?
An embedding turns a piece of text into a list of numbers — a point in space — arranged so that things with similar meaning end up close together. That list of numbers is called a vector. Embeddings are how a computer can compare meaning instead of just matching exact words.
Turning meaning into coordinates
Computers are great with numbers and clumsy with meaning. An embedding model fixes that: you feed it a word or sentence, and it returns a list of numbers — say a few hundred of them. You can picture that list as coordinates for a point in space. The trick is that the model places points so that similar meanings land near each other: "dog" and "puppy" sit close, "dog" and "tax return" sit far apart.
Like a hash that keeps meaning
Developers know hashing: turn data into a fixed-size value. But a normal hash scatters similar inputs to totally different outputs — that's the point of it. An embedding is the opposite kind of hash: similar inputs get similar outputs. Two sentences that mean nearly the same thing produce two nearly-identical vectors. That "similar in, similar out" property is the whole magic.
The essentials
- A vector is just an ordered list of numbers. An embedding is a vector that represents meaning.
- The number of values in the list is its "dimensions" — often a few hundred to a couple thousand.
- The same embedding model must be used for everything you compare — mixing models is like measuring in inches and centimeters.
You can't see 500 dimensions — and that's fine
We draw embeddings on a flat 2D plane so you can build intuition. Real embeddings live in hundreds of dimensions, but the idea is identical: closeness means similarity. Speaking of closeness — how exactly do we measure how close two points are? That's next.
Similarity = closeness
If similar meanings sit close together, then measuring similarity just means measuring closeness. The most common way is cosine similarity: it looks at the angle between two vectors. Point the same direction and the score is 1 (very similar); point at right angles and it's 0 (unrelated).
Angle, not just distance
There are a few ways to measure how close two points are, but the favorite for meaning is the angle between them, called cosine similarity. Why angle? Because it ignores how long each arrow is and focuses on which direction it points. Two sentences about the same topic point the same way, even if one is longer than the other.
How to read a cosine score
- Score near 1: the vectors point almost the same direction — nearly the same meaning.
- Score near 0: the vectors are at right angles — unrelated.
- Score near -1: opposite directions — opposite meaning (rare in practice with text).
Like comparing directions on a compass
Imagine two people pointing. You don't care how far their arms reach — you care whether they're pointing at the same thing. Cosine similarity is exactly that: are these two arrows aimed in the same direction? The closer the aim, the higher the score.
This is the engine of search
Once you can score how similar any two pieces of text are, you can do something powerful: take a question, and find the stored text that scores highest against it. That's semantic search — the next lesson.
Semantic search: finding by meaning
Semantic search finds text by meaning instead of exact keywords. You embed your documents once and store the vectors; then you embed the question and grab the stored vectors closest to it. That's why a search for "how do I get my money back" can find a page titled "Refund policy" — no shared words, but very close in meaning.
Keyword search vs. meaning search
Old-style search matches words: type "refund" and it finds pages containing "refund." Miss the exact word and you get nothing. Semantic search matches meaning: it turns your query into a vector and finds the stored vectors nearest to it, so "get my money back" and "refund policy" land together even though they share no words.
How semantic search works
- Ahead of time (indexing): split your documents into chunks, embed each chunk into a vector, and store them.
- At question time: embed the user's query into a vector.
- Compare the query vector to the stored vectors and return the closest few — those are your best matches.
Where a vector database comes in
If you have thousands or millions of chunks, comparing the query against every single one is slow. A vector database is a specialized store built to find the nearest vectors fast, even among millions. Think of it as an index — but an index organized by meaning instead of by alphabetical keyword.
We've built every piece of RAG
Retrieve-by-meaning is the "Retrieval" in Retrieval-Augmented Generation. You now have all the parts: an LLM that generates, and semantic search that retrieves. The next lesson snaps them together.
What is RAG? Retrieve, Augment, Generate
RAG (Retrieval-Augmented Generation) is a technique that lets a language model look up relevant information before it answers. It works in three steps: Retrieve the most relevant text, Augment the prompt by pasting that text in as context, and Generate an answer grounded in it. RAG gives a frozen model fresh, private, and citable knowledge — without retraining it.
Putting the pieces together
You've met the two halves. The LLM can generate fluent answers but hallucinates when it doesn't know. Semantic search can find relevant text by meaning. RAG connects them: before the model answers, we search for the most relevant text and hand it to the model as part of the prompt. Now the model isn't guessing from memory — it's answering from real, retrieved material.
The three steps of RAG
- Retrieve: take the user's question, embed it, and use semantic search to pull the most relevant chunks from your knowledge base.
- Augment: paste those chunks into the prompt as context, along with the original question and an instruction like "answer using the context above."
- Generate: the LLM writes an answer grounded in the retrieved text — and can even cite which chunk it used.
Two stages: prepare, then answer
There's a one-time preparation stage (called ingestion): clean your documents, split them into chunks, embed each chunk, and store the vectors. That happens before anyone asks anything. Then, at question time, the Retrieve → Augment → Generate loop runs for every query. Keeping these two stages straight is the single most useful mental model for RAG.
Why RAG is such a big deal
- Fresh: update the knowledge base and the answers update — no retraining.
- Private: it can answer from your internal documents, which the model never trained on.
- Trustworthy: because answers come from retrieved sources, you can show citations and cut hallucinations.
- Cheap and flexible: far less costly than retraining a model every time your data changes.
The origin
The term was coined in a 2020 paper by Patrick Lewis and colleagues at Meta AI. The core insight: combine the model's built-in knowledge with a searchable, swappable external memory — the best of both.
RAG vs fine-tuning
Fine-tuning and RAG solve different problems. Fine-tuning changes how a model behaves — its style, tone, and format — by training it further on examples. RAG changes what a model knows right now by giving it information to retrieve. If you need current or private facts, reach for RAG; if you need a consistent voice or output format, reach for fine-tuning.
Knowledge vs. behavior
Here's the cleanest way to hold the difference: RAG is about knowledge, fine-tuning is about behavior. RAG hands the model facts at question time without changing the model. Fine-tuning permanently adjusts the model's weights so it responds in a certain style or format — but it does not reliably teach it new facts.
Side by side
| RAG | Fine-tuning | |
|---|---|---|
| Changes | What the model knows | How the model behaves |
| Update speed | Instant — edit the documents | Slow — retrain on new examples |
| Best for | Current, private, citable facts | Consistent tone, style, or format |
| Hallucinations | Reduces them (grounded answers) | Doesn't fix factual gaps |
| Cost to change | Low | Higher |
You can use both
They aren't rivals. A common production setup fine-tunes a model for the right voice and output format, then uses RAG to feed it accurate, up-to-date facts. Behavior from fine-tuning, knowledge from retrieval.
Rule of thumb
"The model says the wrong facts" → RAG. "The model says it in the wrong way" → fine-tuning. When in doubt, start with RAG: it's cheaper, faster to change, and directly attacks hallucination.
Where RAG is used — and try it yourself
RAG powers most AI tools that seem to "know" specific documents: chatbots that answer from a company's help center, assistants that search internal wikis, research tools that cite sources, and customer support that pulls from product manuals. Below, a simulated RAG pipeline lets you watch Retrieve → Augment → Generate happen on a tiny example.
Common real-world uses
- Support chatbots that answer from a help center or product manual — with citations.
- "Chat with your documents" tools for contracts, research papers, or wikis.
- Internal knowledge assistants that search a company's private data.
- Research and analysis tools that must show where each claim came from.
Try it: a simulated RAG pipeline
Pick a question below. Watch the pipeline embed it, light up the matching chunks in the mini knowledge base, paste them into the prompt, and generate a grounded answer. Everything here runs in your browser on a fixed example — there's no real AI model and nothing leaves your device. It's a teaching model of the real thing.
You've finished the course
You now understand RAG end to end: an LLM predicts words and can hallucinate; embeddings turn meaning into vectors; similarity finds the closest ones; semantic search retrieves relevant text; and RAG feeds that text to the model to generate grounded answers. Ready to go deeper or build one? The links below continue the journey.
You've got the whole picture
An LLM predicts words and can hallucinate; embeddings turn meaning into vectors; similarity finds the closest ones; retrieval fetches them; and RAG feeds them to the model to ground its answer.
Explore the full question bank →