Question 1

What problem does RAG solve, in plain terms?

Accepted Answer

An LLM (the AI text model) only knows what it was trained on, up to a fixed 'cutoff' date, and it has never seen your private docs, your database, or your wiki. So if you ask it about your company's refund policy, it either says 'I don't know' or makes something up. RAG, which stands for Retrieval-Augmented Generation, fixes this by fetching the relevant pieces of YOUR data at question time and pasting them into the prompt, so the model answers from real facts. Think of it like giving the model an open-book exam instead of relying on memory.

Question 2

What does 'the model's knowledge cutoff' mean, and why does it matter for my app?

Accepted Answer

An LLM is trained once on a giant snapshot of text, frozen at a certain date called the cutoff. After that, it knows nothing new: not yesterday's news, not a product you launched last week, not a row you just inserted in your database. It's like a search index you built months ago and never re-crawled. This matters because users expect current, app-specific answers. RAG (retrieving fresh data at query time) and re-running your data pipeline are how you keep answers current without retraining the whole model.

Question 3

Why not just paste my entire documentation into the prompt every time?

Accepted Answer

Two reasons: limits and cost. Every model has a 'context window', the maximum amount of text (measured in tokens, roughly word-pieces) it can read at once. Even large windows can't hold a whole knowledge base, and you pay per token on every single request, so stuffing megabytes of docs into each call is slow and expensive. RAG sends only the handful of chunks actually relevant to the question, like running a WHERE query instead of SELECT *. You get cheaper, faster, more focused answers and you sidestep the window limit entirely.

Question 4

Walk me through the full RAG flow end to end.

Accepted Answer

Two phases. Offline (indexing), done ahead of time: split your docs into chunks (small passages), turn each chunk into an embedding (a numeric vector that captures meaning), and store those vectors in a vector database. Online (at query time): take the user's question, embed it the same way, ask the vector DB for the chunks whose vectors are closest to the question's vector, paste those top chunks into the prompt, and let the LLM write an answer grounded in them. Indexing is like building a search index; querying is like hitting that index then handing results to the model.

Question 5

What is an 'embedding' and why does RAG depend on it?

Accepted Answer

An embedding is a list of numbers (a vector) that represents the MEANING of a piece of text. A model converts your text into, say, 1024 numbers, positioned so that texts with similar meaning end up close together in that numeric space. 'How do I get a refund?' and 'What's your return policy?' land near each other even with no shared words. RAG uses this to find relevant chunks by meaning, not keywords. Practically, you call an embeddings API and get back a JSON array of floats you store and compare. It's the index that makes semantic search possible.

Question 6

What is RAG, and why can't I just paste all my company docs into the prompt every time?

Accepted Answer

RAG (Retrieval-Augmented Generation) means: before you ask the model a question, you go fetch the few most relevant snippets from your own documents and paste only those into the prompt. The model then answers using that context. Why not paste everything? Models have a context limit (a max number of tokens, like a hard request-body size cap), and stuffing 500 pages would blow past it, cost a fortune per call, and bury the answer in noise. RAG is like a SQL WHERE clause for your knowledge: send only the rows that matter, not the whole table.

Question 7

Walk me slowly through the whole RAG loop once — what are the steps from raw docs to a final answer?

Accepted Answer

Two phases. First, a one-time prep (the 'ingest', like seeding a database): chunk your docs into small passages, embed each chunk (turn its meaning into a list of numbers), and store those vectors. Then, per user question: embed the question the same way, retrieve the top-k most similar chunks from the store, stuff those chunks into the prompt as context, and let the model generate the answer. So: chunk → embed → store (once), then embed-query → retrieve top-k → stuff into prompt → answer (every request). Prep is your migration; retrieve-and-answer is your hot request path.

Question 8

What is a vector database and why can't I just use my regular SQL database?

Accepted Answer

A vector database stores embeddings and is built to answer 'which vectors are closest to this one?' fast, even across millions of rows. A normal SQL WHERE matches exact values; it has no notion of 'closest in meaning'. A vector DB uses a special similarity index (called approximate nearest neighbor, or ANN) to do that quickly. Good news: you may not need a new system. As of 2026, the pgvector extension adds vector search right inside Postgres, so you can keep one database. Dedicated options like Pinecone, Qdrant, Weaviate, or Milvus scale further when you outgrow it.

Question 9

What is 'chunking' and why do I split documents at all?

Accepted Answer

Chunking means cutting your documents into smaller passages (a few paragraphs each) before embedding them. You do it for two reasons. First, you retrieve and inject only the relevant pieces, not whole 50-page PDFs, keeping prompts small and cheap. Second, embeddings capture meaning better for a focused passage than for an entire document, where the 'average meaning' gets muddy. So instead of one vector for a giant manual, you have many vectors, each for a tight, searchable snippet. It's similar to indexing rows instead of one giant blob.

Question 10

How big should my chunks be, and what's chunk overlap?

Accepted Answer

A common starting point is roughly 200-800 tokens per chunk (a token is about three-quarters of a word). Too small and a chunk lacks enough context to be useful; too big and it dilutes meaning and wastes prompt space. Overlap means letting consecutive chunks share some text at their edges, say the last 50-100 tokens of one chunk repeated at the start of the next, so a sentence split across a boundary isn't lost. Start around 500 tokens with 10-15% overlap, then tune by testing real questions against your data.

Question 11

Should I chunk by fixed size or by document structure?

Accepted Answer

Prefer structure when you have it. Splitting on natural boundaries, headings, sections, paragraphs, markdown blocks, keeps each chunk about one coherent idea, which retrieves far better than blindly cutting every N characters mid-sentence. Fixed-size splitting is fine as a fallback for plain text with no structure. A good practical recipe: split on headings/paragraphs first, then if a section is still too long, fall back to fixed-size with overlap. For code or tables, respect their boundaries too; a chunk that ends halfway through a function or row is nearly useless.

Question 12

Show me the shape of an indexing pipeline in pseudo-code.

Accepted Answer

Roughly:
js
for (const doc of docs) {
  const chunks = splitIntoChunks(doc.text, { size: 500, overlap: 75 });
  for (const [i, text] of chunks.entries()) {
    const { vector } = await embed(text); // embeddings API -> float[]
    await vectorDB.upsert({
      id: {doc.id}:{i},
      vector,
      metadata: { source: doc.url, title: doc.title, chunk: text }
    });
  }
}

The key habit: store the original chunk text and a source in metadata alongside the vector. At query time you'll need that text to put in the prompt and that source to show a citation. You run this offline, then re-run it whenever docs change.

Question 13

Show me the shape of the query-time RAG code.

Accepted Answer

Roughly:
js
const { vector } = await embed(userQuestion);
const hits = await vectorDB.query({ vector, topK: 5 });
const context = hits.map(h =>
  [{h.metadata.source}] {h.metadata.chunk}).join("

");
const answer = await llm.chat([
  { role: "system", content:
    "Answer ONLY from the context. If it's not there, say you don't know. Cite sources." },
  { role: "user", content: Context:
{context}

Question: {userQuestion} }
]);

topK is how many chunks you retrieve. You embed the question, pull the closest chunks, label each with its source, and instruct the model to stay within that context. That instruction is what turns retrieval into a grounded answer.

Question 14

What does 'grounding' mean and how do I add citations users can trust?

Accepted Answer

Grounding means the answer is based on the retrieved chunks rather than the model's loose memory, so it's verifiable. Because you stored each chunk's source in metadata, you can show it. Two patterns: pass numbered sources in the prompt ('[1] refund.md … [2] shipping.md …') and ask the model to cite '[1]' inline, then render those as links; or skip model-generated citations entirely and just display the source list of the chunks you retrieved. The second is safer because it can't be fabricated. Citations turn 'trust me' into 'click and verify', which users and auditors love.

Question 15

RAG vs fine-tuning: which do I use, and when?

Accepted Answer

Use RAG to give the model KNOWLEDGE: facts, docs, policies, anything that changes or is private. Use fine-tuning to teach the model a BEHAVIOR or STYLE: always answer in a strict JSON shape, adopt a brand voice, follow a niche format. A memory trick: RAG changes what the model KNOWS, fine-tuning changes how it ACTS. For most app features, especially 'answer questions about our content', RAG is the right and far cheaper first move, you just update the data. Fine-tuning needs labeled examples and a retrain whenever facts change, which is expensive and slow for knowledge.

Question 16

Why does 'bad retrieval' wreck the whole answer, even with a great model?

Accepted Answer

The LLM can only reason over the chunks you hand it. If retrieval surfaces the wrong passages, or misses the one chunk that holds the answer, the model has nothing to recover with; it will hedge, make something up, or answer from irrelevant text. Upgrading to a smarter model or polishing your prompt won't fix an upstream retrieval miss. Retrieval sets the ceiling; the model only operates beneath it. This is why most RAG quality work is actually retrieval work: better chunking, a better embedding model, hybrid search, and reranking. Garbage in, confident garbage out.

Question 17

What is semantic search and how is it different from keyword search?

Accepted Answer

Keyword search (like SQL LIKE or a classic full-text index) matches the actual words. Semantic search matches MEANING using embeddings: it can find 'how do I cancel my plan' when the doc says 'terminate your subscription', no shared words required. The trade-off is the reverse weakness: semantic search can miss exact tokens, like a part number 'SKU-4471-X' or a rare acronym, because those carry little 'meaning' to embed. Keyword search nails those. That complementary strength is exactly why hybrid search exists.

Question 18

What is hybrid search and why would I use it?

Accepted Answer

Hybrid search runs both keyword search and semantic (vector) search, then merges the results. You get the best of both: semantic catches paraphrases and concepts, keyword nails exact strings like IDs, error codes, product names, and rare jargon that embeddings blur. In practice you score each candidate by both methods and combine them (a common merge, Reciprocal Rank Fusion, just blends the two ranked lists, no math you need to write yourself). Many vector databases offer hybrid search as a built-in flag. If your domain has lots of exact identifiers or proper nouns, hybrid usually beats pure vector search noticeably.

Question 19

What is reranking, in one line, and when is it worth adding?

Accepted Answer

Reranking is a second, smarter pass: you retrieve a generous batch (say top 30) cheaply, then a reranker model re-scores each chunk against the question for true relevance and you keep the best few (say top 5) for the prompt. The first pass is fast but rough; the reranker is slower but far more accurate because it reads the query and chunk together. Add it when answers feel 'close but not quite', or relevant chunks exist but rank too low to make the cut. As of 2026, hosted rerankers (e.g. Cohere, Voyage, Pinecone) make it a quick API add.

Question 20

What are the most common ways a RAG system fails?

Accepted Answer

The usual suspects: retrieval misses the right chunk (bad chunking or a weak embedding model); chunks are too big or small so meaning is muddy or context is lost; the model ignores the context and answers from memory anyway (fight this with a firm 'answer only from context' instruction); stale data because you never re-indexed after docs changed; the answer exists but is split across two chunks, neither of which is complete; and 'silent confidence', the model invents an answer when retrieval returns nothing relevant. Notice most are retrieval and data problems, not model problems.

Question 21

How do I stop my RAG bot from confidently making things up when it has no good context?

Accepted Answer

Three layers. First, instruct firmly in the system prompt: 'Answer only using the provided context. If the answer isn't there, say you don't know.' Second, set a relevance threshold: if the top chunk's similarity score is below a cutoff, treat it as 'no good match' and short-circuit to a safe fallback before even calling the model. Third, show the citations so a wrong answer is easy to catch. The threshold check is the powerful one, it's plain code you control, not a hope that the model behaves. Fail closed, not confidently.

Question 22

How do I keep my RAG data fresh as docs change?

Accepted Answer

Treat indexing as a pipeline you re-run, like re-crawling a search index. When a doc is created, edited, or deleted, re-chunk and re-embed just that doc and upsert the new vectors (using stable IDs like docId:chunkIndex so updates overwrite and deletes remove the old chunks). Trigger it from a webhook on save, a queue job, or a scheduled batch (nightly is common). The trap: forgetting to DELETE vectors for removed content, so the bot keeps citing a policy you retired. Stale or orphaned chunks are a top cause of 'why did it say that?'.

Question 23

How do I pick an embedding model, and does the choice really matter?

Accepted Answer

It matters a lot, it sets your retrieval ceiling. Pick based on: your domain (general vs code/legal/finance), languages, and whether you want a hosted API or to self-host. As of 2026, common API picks are OpenAI text-embedding-3-large, Cohere embed-v4 (strong multilingual, very long docs), and Voyage's voyage-3-large (great for code/domain text); for self-hosting, open models like BGE-M3. One hard rule: embed your stored chunks AND your queries with the SAME model. If you ever switch models, you must re-embed everything, the vectors aren't comparable across models.

Question 24

Do I have to build all this plumbing myself, or are there frameworks?

Accepted Answer

You can hand-roll it, the pieces are just an embeddings API, a vector store, and a chat call, which is genuinely fine for a simple feature and keeps you in control. Or use a framework: as of 2026, LangChain and LlamaIndex (both in JS and Python) give you ready-made chunkers, loaders for PDFs/HTML, vector-store connectors, and retrieval chains. Managed 'RAG-in-a-box' services also exist that ingest, chunk, embed, and serve search behind one API. Start hand-rolled to understand the flow; reach for a framework when document loading, many sources, or evaluation start eating your time.

Question 25

What metadata should I store with each chunk, and why does it pay off?

Accepted Answer

Store more than the vector: the original chunk text (you need it to build the prompt), a source identifier (URL/file/title for citations), and useful filters like tenant_id, product, language, updated_at, or access level. Why: vector DBs let you filter by metadata WHILE doing the similarity search, so you can restrict retrieval to one customer's docs (multi-tenancy) or only published content, just like a WHERE tenant_id = ? alongside the semantic match. Skipping metadata is the mistake that forces a painful re-index later when you suddenly need per-user isolation or citations.

Question 26

How do I know if my RAG system is actually good, before users complain?

Accepted Answer

Don't eyeball it, build a small evaluation set: 20-50 real questions paired with the correct source/answer. Then measure two things separately. Retrieval quality: did the right chunk show up in the top-K results? (If not, no prompt tweak will help, fix chunking, embeddings, or add reranking.) Answer quality: is the final response correct, grounded, and properly cited? You can even use a strong LLM as a judge to score answers against the expected source. Re-run this set after every change, like a test suite for your AI feature, so you catch regressions instead of shipping them.

Question 27

Can you make the RAG loop concrete with a tiny worked example — say a user asks about my refund window?

Accepted Answer

Sure. Earlier you ingested policy.md; one chunk was "Refunds are accepted within 30 days of purchase." That chunk got embedded and stored. Now a user asks "how long do I have to return something?" You embed that question, search your store, and the refund chunk comes back as the closest match (top-k=1 here). You build the prompt: Context: "Refunds...within 30 days..."

Question: how long do I have to return something?
Answer using only the context. The model replies "You have 30 days." Notice the user never typed "refund" — matching happened on meaning, not keywords.

Question 28

In that RAG flow, what's the difference between 'embed' and 'retrieve', and what is 'top-k'?

Accepted Answer

Embedding is the conversion step: you turn a piece of text into a vector — a fixed-length list of numbers that captures its meaning, so texts about similar ideas land near each other. You embed both your chunks (at ingest) and the question (at query time). Retrieval is the lookup: you take the question's vector and find the stored chunk vectors closest to it, the way a database index finds matching rows fast. 'Top-k' is just how many you grab — top-k=4 means the 4 nearest chunks. Bigger k = more context but more tokens and more noise; it's a knob you tune.

RAG — Give the Model Your Own Data