Putting the pieces together
You've met the two halves. The LLM can generate fluent answers but hallucinates when it doesn't know. Semantic search can find relevant text by meaning. RAG connects them: before the model answers, we search for the most relevant text and hand it to the model as part of the prompt. Now the model isn't guessing from memory — it's answering from real, retrieved material.
The three steps of RAG
- Retrieve: take the user's question, embed it, and use semantic search to pull the most relevant chunks from your knowledge base.
- Augment: paste those chunks into the prompt as context, along with the original question and an instruction like "answer using the context above."
- Generate: the LLM writes an answer grounded in the retrieved text — and can even cite which chunk it used.
Two stages: prepare, then answer
There's a one-time preparation stage (called ingestion): clean your documents, split them into chunks, embed each chunk, and store the vectors. That happens before anyone asks anything. Then, at question time, the Retrieve → Augment → Generate loop runs for every query. Keeping these two stages straight is the single most useful mental model for RAG.
Why RAG is such a big deal
- Fresh: update the knowledge base and the answers update — no retraining.
- Private: it can answer from your internal documents, which the model never trained on.
- Trustworthy: because answers come from retrieved sources, you can show citations and cut hallucinations.
- Cheap and flexible: far less costly than retraining a model every time your data changes.
The origin
The term was coined in a 2020 paper by Patrick Lewis and colleagues at Meta AI. The core insight: combine the model's built-in knowledge with a searchable, swappable external memory — the best of both.