RAG — Give the Model Your Own Data

Retrieval-Augmented Generation: grounding LLMs in your own data with embeddings and vector search.

Learn this interactively →
Basicsconcept

What problem does RAG solve, in plain terms?

An LLM (the AI text model) only knows what it was trained on, up to a fixed 'cutoff' date, and it has never seen your private docs, your database, or your wiki. So if you ask it about your company's refund policy, it either says 'I don't know' or makes something up. RAG, which stands for Retrieval-Augmented Generation, fixes this by fetching the relevant pieces of YOUR data at question time and pasting them into the prompt, so the model answers from real facts. Think of it like giving the model an open-book exam instead of relying on memory.
#rag#fundamentals#knowledge
Basicsconcept

What does 'the model's knowledge cutoff' mean, and why does it matter for my app?

An LLM is trained once on a giant snapshot of text, frozen at a certain date called the cutoff. After that, it knows nothing new: not yesterday's news, not a product you launched last week, not a row you just inserted in your database. It's like a search index you built months ago and never re-crawled. This matters because users expect current, app-specific answers. RAG (retrieving fresh data at query time) and re-running your data pipeline are how you keep answers current without retraining the whole model.
#cutoff#staleness#knowledge
Basicsdecision

Why not just paste my entire documentation into the prompt every time?

Two reasons: limits and cost. Every model has a 'context window', the maximum amount of text (measured in tokens, roughly word-pieces) it can read at once. Even large windows can't hold a whole knowledge base, and you pay per token on every single request, so stuffing megabytes of docs into each call is slow and expensive. RAG sends only the handful of chunks actually relevant to the question, like running a WHERE query instead of SELECT *. You get cheaper, faster, more focused answers and you sidestep the window limit entirely.
#context-window#cost#tokens
Basicsconcept

Walk me through the full RAG flow end to end.

Two phases. Offline (indexing), done ahead of time: split your docs into chunks (small passages), turn each chunk into an embedding (a numeric vector that captures meaning), and store those vectors in a vector database. Online (at query time): take the user's question, embed it the same way, ask the vector DB for the chunks whose vectors are closest to the question's vector, paste those top chunks into the prompt, and let the LLM write an answer grounded in them. Indexing is like building a search index; querying is like hitting that index then handing results to the model.
#rag#pipeline#architecture
Basicsconcept

What is an 'embedding' and why does RAG depend on it?

An embedding is a list of numbers (a vector) that represents the MEANING of a piece of text. A model converts your text into, say, 1024 numbers, positioned so that texts with similar meaning end up close together in that numeric space. 'How do I get a refund?' and 'What's your return policy?' land near each other even with no shared words. RAG uses this to find relevant chunks by meaning, not keywords. Practically, you call an embeddings API and get back a JSON array of floats you store and compare. It's the index that makes semantic search possible.
#embeddings#vectors#semantic-search
Basicsconcept

What is RAG, and why can't I just paste all my company docs into the prompt every time?

RAG (Retrieval-Augmented Generation) means: before you ask the model a question, you go fetch the few most relevant snippets from your own documents and paste only those into the prompt. The model then answers using that context. Why not paste everything? Models have a context limit (a max number of tokens, like a hard request-body size cap), and stuffing 500 pages would blow past it, cost a fortune per call, and bury the answer in noise. RAG is like a SQL WHERE clause for your knowledge: send only the rows that matter, not the whole table.
#rag#context-window#retrieval#basics
Basicsconcept

Walk me slowly through the whole RAG loop once — what are the steps from raw docs to a final answer?

Two phases. First, a one-time prep (the 'ingest', like seeding a database): chunk your docs into small passages, embed each chunk (turn its meaning into a list of numbers), and store those vectors. Then, per user question: embed the question the same way, retrieve the top-k most similar chunks from the store, stuff those chunks into the prompt as context, and let the model generate the answer. So: chunk → embed → store (once), then embed-query → retrieve top-k → stuff into prompt → answer (every request). Prep is your migration; retrieve-and-answer is your hot request path.
#rag#pipeline#embeddings#retrieval
Core ideaconcept

What is a vector database and why can't I just use my regular SQL database?

A vector database stores embeddings and is built to answer 'which vectors are closest to this one?' fast, even across millions of rows. A normal SQL WHERE matches exact values; it has no notion of 'closest in meaning'. A vector DB uses a special similarity index (called approximate nearest neighbor, or ANN) to do that quickly. Good news: you may not need a new system. As of 2026, the pgvector extension adds vector search right inside Postgres, so you can keep one database. Dedicated options like Pinecone, Qdrant, Weaviate, or Milvus scale further when you outgrow it.
#vector-db#pgvector#pinecone
Core ideaconcept

What is 'chunking' and why do I split documents at all?

Chunking means cutting your documents into smaller passages (a few paragraphs each) before embedding them. You do it for two reasons. First, you retrieve and inject only the relevant pieces, not whole 50-page PDFs, keeping prompts small and cheap. Second, embeddings capture meaning better for a focused passage than for an entire document, where the 'average meaning' gets muddy. So instead of one vector for a giant manual, you have many vectors, each for a tight, searchable snippet. It's similar to indexing rows instead of one giant blob.
#chunking#preprocessing#retrieval
Core ideahow-to

How big should my chunks be, and what's chunk overlap?

A common starting point is roughly 200-800 tokens per chunk (a token is about three-quarters of a word). Too small and a chunk lacks enough context to be useful; too big and it dilutes meaning and wastes prompt space. Overlap means letting consecutive chunks share some text at their edges, say the last 50-100 tokens of one chunk repeated at the start of the next, so a sentence split across a boundary isn't lost. Start around 500 tokens with 10-15% overlap, then tune by testing real questions against your data.
#chunk-size#overlap#tuning
Core ideadecision

Should I chunk by fixed size or by document structure?

Prefer structure when you have it. Splitting on natural boundaries, headings, sections, paragraphs, markdown blocks, keeps each chunk about one coherent idea, which retrieves far better than blindly cutting every N characters mid-sentence. Fixed-size splitting is fine as a fallback for plain text with no structure. A good practical recipe: split on headings/paragraphs first, then if a section is still too long, fall back to fixed-size with overlap. For code or tables, respect their boundaries too; a chunk that ends halfway through a function or row is nearly useless.
#chunking#structure#strategy
Core ideacode

Show me the shape of an indexing pipeline in pseudo-code.

Roughly: ``js for (const doc of docs) { const chunks = splitIntoChunks(doc.text, { size: 500, overlap: 75 }); for (const [i, text] of chunks.entries()) { const { vector } = await embed(text); // embeddings API -> float[] await vectorDB.upsert({ id: ${doc.id}:${i}, vector, metadata: { source: doc.url, title: doc.title, chunk: text } }); } } ` The key habit: store the original chunk text and a source` in metadata alongside the vector. At query time you'll need that text to put in the prompt and that source to show a citation. You run this offline, then re-run it whenever docs change.
#indexing#upsert#metadata
Core ideacode

Show me the shape of the query-time RAG code.

Roughly: ``js const { vector } = await embed(userQuestion); const hits = await vectorDB.query({ vector, topK: 5 }); const context = hits.map(h => [${h.metadata.source}] ${h.metadata.chunk}).join("\n\n"); const answer = await llm.chat([ { role: "system", content: "Answer ONLY from the context. If it's not there, say you don't know. Cite sources." }, { role: "user", content: Context:\n${context}\n\nQuestion: ${userQuestion} } ]); ` topK` is how many chunks you retrieve. You embed the question, pull the closest chunks, label each with its source, and instruct the model to stay within that context. That instruction is what turns retrieval into a grounded answer.
#retrieval#topk#prompt
Core ideahow-to

What does 'grounding' mean and how do I add citations users can trust?

Grounding means the answer is based on the retrieved chunks rather than the model's loose memory, so it's verifiable. Because you stored each chunk's source in metadata, you can show it. Two patterns: pass numbered sources in the prompt ('[1] refund.md … [2] shipping.md …') and ask the model to cite '[1]' inline, then render those as links; or skip model-generated citations entirely and just display the source list of the chunks you retrieved. The second is safer because it can't be fabricated. Citations turn 'trust me' into 'click and verify', which users and auditors love.
#grounding#citations#trust
Core ideadecision

RAG vs fine-tuning: which do I use, and when?

Use RAG to give the model KNOWLEDGE: facts, docs, policies, anything that changes or is private. Use fine-tuning to teach the model a BEHAVIOR or STYLE: always answer in a strict JSON shape, adopt a brand voice, follow a niche format. A memory trick: RAG changes what the model KNOWS, fine-tuning changes how it ACTS. For most app features, especially 'answer questions about our content', RAG is the right and far cheaper first move, you just update the data. Fine-tuning needs labeled examples and a retrain whenever facts change, which is expensive and slow for knowledge.
#fine-tuning#rag#decision
Core ideagotcha

Why does 'bad retrieval' wreck the whole answer, even with a great model?

The LLM can only reason over the chunks you hand it. If retrieval surfaces the wrong passages, or misses the one chunk that holds the answer, the model has nothing to recover with; it will hedge, make something up, or answer from irrelevant text. Upgrading to a smarter model or polishing your prompt won't fix an upstream retrieval miss. Retrieval sets the ceiling; the model only operates beneath it. This is why most RAG quality work is actually retrieval work: better chunking, a better embedding model, hybrid search, and reranking. Garbage in, confident garbage out.
#retrieval#quality#ceiling
Core ideaconcept

What is semantic search and how is it different from keyword search?

Keyword search (like SQL LIKE or a classic full-text index) matches the actual words. Semantic search matches MEANING using embeddings: it can find 'how do I cancel my plan' when the doc says 'terminate your subscription', no shared words required. The trade-off is the reverse weakness: semantic search can miss exact tokens, like a part number 'SKU-4471-X' or a rare acronym, because those carry little 'meaning' to embed. Keyword search nails those. That complementary strength is exactly why hybrid search exists.
#semantic-search#keyword-search#embeddings
Core ideadecision

What is hybrid search and why would I use it?

Hybrid search runs both keyword search and semantic (vector) search, then merges the results. You get the best of both: semantic catches paraphrases and concepts, keyword nails exact strings like IDs, error codes, product names, and rare jargon that embeddings blur. In practice you score each candidate by both methods and combine them (a common merge, Reciprocal Rank Fusion, just blends the two ranked lists, no math you need to write yourself). Many vector databases offer hybrid search as a built-in flag. If your domain has lots of exact identifiers or proper nouns, hybrid usually beats pure vector search noticeably.
#hybrid-search#keyword#fusion
Core ideaconcept

What is reranking, in one line, and when is it worth adding?

Reranking is a second, smarter pass: you retrieve a generous batch (say top 30) cheaply, then a reranker model re-scores each chunk against the question for true relevance and you keep the best few (say top 5) for the prompt. The first pass is fast but rough; the reranker is slower but far more accurate because it reads the query and chunk together. Add it when answers feel 'close but not quite', or relevant chunks exist but rank too low to make the cut. As of 2026, hosted rerankers (e.g. Cohere, Voyage, Pinecone) make it a quick API add.
#reranking#retrieval#precision
Core ideagotcha

What are the most common ways a RAG system fails?

The usual suspects: retrieval misses the right chunk (bad chunking or a weak embedding model); chunks are too big or small so meaning is muddy or context is lost; the model ignores the context and answers from memory anyway (fight this with a firm 'answer only from context' instruction); stale data because you never re-indexed after docs changed; the answer exists but is split across two chunks, neither of which is complete; and 'silent confidence', the model invents an answer when retrieval returns nothing relevant. Notice most are retrieval and data problems, not model problems.
#failure-modes#debugging#retrieval
Core ideahow-to

How do I stop my RAG bot from confidently making things up when it has no good context?

Three layers. First, instruct firmly in the system prompt: 'Answer only using the provided context. If the answer isn't there, say you don't know.' Second, set a relevance threshold: if the top chunk's similarity score is below a cutoff, treat it as 'no good match' and short-circuit to a safe fallback before even calling the model. Third, show the citations so a wrong answer is easy to catch. The threshold check is the powerful one, it's plain code you control, not a hope that the model behaves. Fail closed, not confidently.
#hallucination#grounding#threshold
Core ideahow-to

How do I keep my RAG data fresh as docs change?

Treat indexing as a pipeline you re-run, like re-crawling a search index. When a doc is created, edited, or deleted, re-chunk and re-embed just that doc and upsert the new vectors (using stable IDs like docId:chunkIndex so updates overwrite and deletes remove the old chunks). Trigger it from a webhook on save, a queue job, or a scheduled batch (nightly is common). The trap: forgetting to DELETE vectors for removed content, so the bot keeps citing a policy you retired. Stale or orphaned chunks are a top cause of 'why did it say that?'.
#freshness#reindex#pipeline
Core ideadecision

How do I pick an embedding model, and does the choice really matter?

It matters a lot, it sets your retrieval ceiling. Pick based on: your domain (general vs code/legal/finance), languages, and whether you want a hosted API or to self-host. As of 2026, common API picks are OpenAI text-embedding-3-large, Cohere embed-v4 (strong multilingual, very long docs), and Voyage's voyage-3-large (great for code/domain text); for self-hosting, open models like BGE-M3. One hard rule: embed your stored chunks AND your queries with the SAME model. If you ever switch models, you must re-embed everything, the vectors aren't comparable across models.
#embeddings#model-choice#2026
Core ideadecision

Do I have to build all this plumbing myself, or are there frameworks?

You can hand-roll it, the pieces are just an embeddings API, a vector store, and a chat call, which is genuinely fine for a simple feature and keeps you in control. Or use a framework: as of 2026, LangChain and LlamaIndex (both in JS and Python) give you ready-made chunkers, loaders for PDFs/HTML, vector-store connectors, and retrieval chains. Managed 'RAG-in-a-box' services also exist that ingest, chunk, embed, and serve search behind one API. Start hand-rolled to understand the flow; reach for a framework when document loading, many sources, or evaluation start eating your time.
#langchain#llamaindex#tooling
Core ideahow-to

What metadata should I store with each chunk, and why does it pay off?

Store more than the vector: the original chunk text (you need it to build the prompt), a source identifier (URL/file/title for citations), and useful filters like tenant_id, product, language, updated_at, or access level. Why: vector DBs let you filter by metadata WHILE doing the similarity search, so you can restrict retrieval to one customer's docs (multi-tenancy) or only published content, just like a WHERE tenant_id = ? alongside the semantic match. Skipping metadata is the mistake that forces a painful re-index later when you suddenly need per-user isolation or citations.
#metadata#filtering#multi-tenancy
Core ideahow-to

How do I know if my RAG system is actually good, before users complain?

Don't eyeball it, build a small evaluation set: 20-50 real questions paired with the correct source/answer. Then measure two things separately. Retrieval quality: did the right chunk show up in the top-K results? (If not, no prompt tweak will help, fix chunking, embeddings, or add reranking.) Answer quality: is the final response correct, grounded, and properly cited? You can even use a strong LLM as a judge to score answers against the expected source. Re-run this set after every change, like a test suite for your AI feature, so you catch regressions instead of shipping them.
#evaluation#testing#quality
Core ideacode

Can you make the RAG loop concrete with a tiny worked example — say a user asks about my refund window?

Sure. Earlier you ingested policy.md; one chunk was "Refunds are accepted within 30 days of purchase." That chunk got embedded and stored. Now a user asks "how long do I have to return something?" You embed that question, search your store, and the refund chunk comes back as the closest match (top-k=1 here). You build the prompt: Context: "Refunds...within 30 days..."\n\nQuestion: how long do I have to return something?\nAnswer using only the context. The model replies "You have 30 days." Notice the user never typed "refund" — matching happened on meaning, not keywords.
#rag#example#embeddings#prompt
Core ideaconcept

In that RAG flow, what's the difference between 'embed' and 'retrieve', and what is 'top-k'?

Embedding is the conversion step: you turn a piece of text into a vector — a fixed-length list of numbers that captures its meaning, so texts about similar ideas land near each other. You embed both your chunks (at ingest) and the question (at query time). Retrieval is the lookup: you take the question's vector and find the stored chunk vectors closest to it, the way a database index finds matching rows fast. 'Top-k' is just how many you grab — top-k=4 means the 4 nearest chunks. Bigger k = more context but more tokens and more noise; it's a knob you tune.
#rag#embeddings#top-k#vector-search