What problem does Retrieval-Augmented Generation (RAG) solve that a plain LLM call does not?
A plain LLM only knows what was in its training data up to its cutoff, held in frozen weights — it can't access private, proprietary, or post-cutoff information and tends to confabulate when asked about it. RAG fetches relevant documents at query time and injects them into the prompt as grounding context, so answers reflect current, private, or domain-specific knowledge. It reduces hallucination, enables citations, lets you update knowledge by re-indexing (no retraining), and keeps sensitive data out of the weights. The model becomes a reasoner over supplied evidence rather than the sole knowledge store.
#rag#grounding#hallucination#motivation
Foundationalconcept
Explain the role of an embedding model in a RAG pipeline and what property makes an embedding 'good' for retrieval.
An embedding model maps text (a chunk or a query) to a dense vector in $\mathbb{R}^d$ such that semantically similar texts land close together under a distance metric (usually cosine or dot product). In RAG it powers semantic retrieval: chunks are embedded offline into the index, the query is embedded online, and nearest neighbors are returned. A good retrieval embedding aligns query intent with answer-bearing passages — often trained with contrastive objectives on query-passage pairs, and frequently asymmetric (a query encoder vs a document encoder, or instruction-prefixed). Dimensionality, normalization, and domain match all affect recall.
#embeddings#semantic-search#contrastive#cosine
Intermediatemath
Compare cosine similarity, dot product, and Euclidean distance for vector search. When are they equivalent?
Cosine measures angle (magnitude-invariant); dot product measures angle and magnitude; Euclidean ($L_2$) measures straight-line distance. For unit-normalized vectors they collapse to monotone equivalents: cosine equals dot product, and $\|a-b\|^2 = 2-2\,a\cdot b$, so ranking by any of the three gives identical neighbor order. They diverge only when magnitudes vary — then dot product rewards longer vectors (useful if magnitude encodes importance/confidence), while cosine ignores it. Most embedding models are trained for one metric; mismatching it (e.g., using $L_2$ on dot-product-trained vectors) degrades recall. Normalize and use the model's intended metric.
#cosine#dot-product#euclidean#normalization
Intermediateconcept
Walk through chunking strategies for RAG and the tradeoffs of chunk size and overlap.
Options: fixed-size token windows (simple, cheap), sentence/paragraph splits, recursive structural splitting (headings → paragraphs → sentences), and semantic chunking (split where embedding similarity drops). Small chunks give precise retrieval and tight grounding but fragment context and may omit surrounding info needed to answer; large chunks preserve context but dilute the embedding (averaging many topics) and waste context-window budget on irrelevant text. Overlap (e.g., 10-20%) prevents answers from being severed at a boundary, at the cost of duplication and index bloat. Best practice: chunk on natural structure, tune size to your embedding model's effective length and the answer granularity, and store metadata for filtering.
#chunking#overlap#preprocessing#context
Intermediateconcept
When should you choose RAG over fine-tuning, and when do you need both?
RAG injects knowledge — use it when facts are dynamic, large/long-tail, must be cited or access-controlled, or change faster than you can retrain; updating means re-indexing, cheaply. Fine-tuning shapes behavior — use it to teach form, style, format, domain jargon, structured-output adherence, or a task pattern the base model handles poorly; it bakes capability into weights but is expensive to update and prone to forgetting/staleness for facts. They're orthogonal: a common pattern is fine-tune for output format/tone and tool-use skill, while RAG supplies the current facts. Fine-tuning is a poor way to memorize a knowledge base (it hallucinates and goes stale); RAG is a poor way to teach a new skill.
#fine-tuning#rag#knowledge-vs-skill#tradeoffs
Intermediatesystem-design
Scenario: You need vector search inside an existing transactional Postgres app with ~2M vectors, strong metadata filtering, and minimal new infra. Which option fits, and what's the main caveat?
Use pgvector — the Postgres extension adds a vector column type and ANN indexes (HNSW and IVFFlat) directly in your existing database, so you get transactional consistency, JOINs, and SQL WHERE metadata filtering with no separate service to operate. At 2M vectors this scales fine. Main caveats: build/tune the HNSW index (memory and ef/m settings), and watch the filtering interaction — naive pre- vs post-filtering with ANN can hurt recall (the index returns k neighbors then filters them away, under-returning). Use enough candidates or pgvector's iterative-scan filtering (0.8+). A dedicated store (Pinecone/Weaviate/Qdrant) only earns its keep at much larger scale or for managed ops you don't want to run.
#pgvector#postgres#metadata-filter#scaling
Advancedconcept
How does HNSW work, and what do its key parameters (M, efConstruction, efSearch) control?
HNSW (Hierarchical Navigable Small World) is a graph ANN index: nodes are vectors connected to nearby neighbors, organized in layers where upper layers are sparse 'express lanes'. Search starts at the top entry point, greedily descends toward the query, and does a best-first beam search at the bottom layer. M is the max neighbors per node (graph degree) — higher M means better recall and more memory. efConstruction is the candidate list size during build — higher gives a better-quality graph at slower build time. efSearch is the runtime beam width — the main recall/latency knob: raise it for accuracy, lower for speed. It gives roughly logarithmic search with high recall but is memory-heavy and costly to update.
#hnsw#ann#graph-index#recall
Advancedconcept
Contrast IVF (inverted file) with HNSW for ANN, and explain the IVF nprobe tradeoff.
IVF partitions the vector space into nlist Voronoi cells via k-means; at query time you compute the query's nearest centroids and search only the nprobe closest cells. Higher nprobe searches more cells → higher recall, slower. Versus HNSW: IVF is cluster/scan-based (cheaper memory, faster to build, easy to add vectors, and pairs naturally with PQ compression for billion-scale), but recall depends on cell boundaries — a true neighbor in an unprobed cell is missed. HNSW is graph-based with generally higher recall at low latency but heavier RAM and slow incremental updates. IVF+PQ wins on huge, memory-constrained corpora; HNSW wins on quality at moderate scale. FAISS supports both.
#ivf#faiss#hnsw#product-quantization
Advancedconcept
Why is hybrid search (BM25 + dense) often better than dense-only, and how are the scores combined?
Dense retrieval captures semantics/paraphrase but can miss exact lexical matches — rare tokens, codes, product SKUs, names, acronyms — where BM25 (sparse term-frequency) excels. They fail in complementary ways, so combining recovers both. Fusion methods: weighted score combination (requires normalization since BM25 and cosine are on different scales) or Reciprocal Rank Fusion (RRF), $\text{score}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)}$, which is rank-based and scale-free, hence robust and the common default. Hybrid notably improves recall on out-of-domain queries and keyword-heavy enterprise corpora where pure embeddings underperform on exact-match needs.
#bm25#hybrid#rrf#sparse-dense
Advancedconcept
What is a cross-encoder reranker, why is it more accurate than the bi-encoder retriever, and why not use it for the whole corpus?
A bi-encoder embeds query and document independently, so similarity is a cheap dot product over precomputed vectors — fast and indexable but the two never interact. A cross-encoder concatenates query and document and runs them jointly through a transformer, producing a single relevance score with full token-level cross-attention — far more accurate because it models fine-grained interactions. The catch: it can't precompute document vectors, so it must run one forward pass per (query, doc) pair at query time — $O(N)$ over the corpus, infeasible at scale. So you use the cheap bi-encoder/ANN to retrieve top-k (e.g., 100), then rerank just those k with the cross-encoder to pick the final few.
#cross-encoder#reranker#bi-encoder#two-stage
Advancedmath
Define recall@k, MRR, and nDCG for retrieval evaluation. Which best captures reranker quality?
Recall@k = fraction of relevant docs appearing in the top k (order-agnostic) — measures whether candidates surface at all. MRR (Mean Reciprocal Rank) = average of $1/\text{rank}$ of the first relevant result — rewards putting one right answer high, ignores the rest. nDCG (normalized Discounted Cumulative Gain) = sum of graded relevance discounted by $\log_2(\text{rank}+1)$, normalized by the ideal ordering — order-sensitive and handles multi-level relevance. For a reranker, whose entire job is to reorder a fixed candidate set, recall@k is invariant (same set), so nDCG (or MRR) is the right metric since it scores the ordering, not mere presence.
#recall#mrr#ndcg#metrics
Advancedconcept
What is query rewriting / query transformation in RAG, and name three concrete techniques with their purpose.
The raw user query is often a poor retrieval probe (too terse, conversational, multi-hop, or full of pronouns). Query transformation rewrites it before embedding. Three techniques: (1) Multi-query / query expansion — generate several paraphrases and union their retrieved sets to boost recall. (2) HyDE (Hypothetical Document Embeddings) — have an LLM draft a hypothetical answer and embed that, since an answer sits closer in vector space to real answer passages than the question does. (3) Decomposition — split a complex multi-hop question into sub-questions retrieved independently. Also: conversational rewriting that resolves coreference using chat history into a standalone query. All trade extra latency/LLM calls for higher recall.
#query-rewriting#hyde#multi-query#decomposition
Advancedconcept
You're evaluating a retriever and report recall@10 = 0.92 but downstream answer faithfulness is poor. Explain the difference between retrieval metrics (recall@k, MRR, NDCG@k) and RAG-quality metrics (context precision, context recall, faithfulness), and which gap recall@10 fails to capture.
recall@k measures whether relevant chunks appear in the top-k; MRR rewards the rank of the first relevant hit; NDCG@k adds graded relevance with logarithmic position discounting. These are set/ranking metrics over a labeled corpus. RAG-quality metrics are answer-conditioned: context precision = fraction of retrieved context that is actually relevant (signal-to-noise), context recall = fraction of the ground-truth answer's claims supported by retrieved context, and faithfulness = fraction of generated claims grounded in the context (no hallucination). High recall@10 with poor faithfulness means relevant chunks are present but buried among distractors (low context precision), so the LLM is distracted or ungrounded — recall@k is blind to precision and to whether the generator used the context.
#recall#ndcg#faithfulness#context-precision#ragas
Advancedconcept
Explain GraphRAG and Anthropic's 'contextual retrieval' (and parent-document retrieval). What specific retrieval failure does each fix that naive chunk-embedding cannot, and what's the cost?
Naive chunk embedding loses two things: cross-document/global structure and per-chunk context. GraphRAG builds an entity-relationship knowledge graph plus community summaries over the corpus, so it answers global/aggregative questions ('what are the main themes?') that no local chunk contains — at high indexing cost (LLM-driven entity extraction + summarization over the whole corpus). Parent-document retrieval embeds small chunks for precise matching but returns the larger parent passage to the LLM, fixing the precision-vs-context tradeoff. Contextual retrieval (Anthropic) prepends an LLM-generated chunk-specific context blurb (situating the chunk in its document) before embedding/BM25-indexing, fixing the 'orphaned chunk' problem where 'the company grew 3%' loses which company/quarter — reported ~35-49% fewer retrieval failures, paid for with a one-time per-chunk LLM pass (cached) at index time.
Your RAG answers are wrong even though the correct passage is in the index. Diagnose the failure stages.
Trace the pipeline. Retrieval miss: the chunk wasn't in top-k — check embedding/query mismatch (asymmetric model, missing instruction prefix), wrong distance metric, chunk too large diluting the signal, or low ANN recall (raise efSearch/nprobe). Ranking miss: it was retrieved but buried below junk — add a reranker. Packing miss: it was in context but the model ignored it — 'lost in the middle' positional bias, too many distractor chunks, or context overflow truncating it. Generation miss: model has it but overrides with parametric knowledge — strengthen grounding instructions, lower temperature, demand citations. Evaluate each stage separately (recall@k, then end-to-end) rather than guessing.
Explain 'lost in the middle' and give concrete context-packing strategies to mitigate it.
LLMs attend most reliably to information at the very start and end of the context window; relevant content buried in the middle of a long prompt is recalled markedly worse — a U-shaped accuracy curve. Mitigations: (1) retrieve fewer, higher-quality chunks (rerank, dedupe) rather than stuffing many; (2) reorder so the most relevant passages sit at the beginning and end (sandwich packing); (3) keep context tight — every irrelevant token both costs budget and acts as a distractor; (4) compress or summarize retrieved chunks; (5) for very long needs, use map-reduce or iterative retrieval instead of one giant prompt. More context is not strictly better — signal-to-noise dominates.
How do you make a RAG system produce trustworthy citations, and what failure mode must server-side code defend against?
Attach a stable ID to every chunk, carry it through retrieval, and instruct the model to emit claim-level citations referencing those IDs. The critical failure: the model may fabricate or mis-attribute a citation, or cite a real doc that doesn't actually support the claim. So don't trust the model's citation — verify it server-side: confirm each cited ID was genuinely in the retrieved set (allow-list), and ideally check the cited span's overlap/entailment with the claim (string match or an NLI/faithfulness check). Render only validated citations; flag or drop unsupported claims. This 'model proposes, code disposes' pattern turns citations from decoration into an auditable grounding contract.
#citations#grounding#faithfulness#validation
Expertsystem-design
At the limit, why does naive RAG fail on questions requiring synthesis across many documents, and what architectures address it?
Top-k similarity retrieval is local: it surfaces the k chunks most similar to the query, but global/aggregative questions ('what are the main themes', 'how many X', multi-hop reasoning) need information spread thinly across the whole corpus that no single chunk resembles the query closely. So the right evidence never enters context. Fixes: (1) iterative/agentic RAG — the model retrieves, reasons, then issues follow-up queries (decomposition, self-RAG, ReAct loops); (2) hierarchical summarization (RAPTOR-style trees) so higher-level nodes hold aggregated context; (3) GraphRAG — build an entity/relationship knowledge graph plus community summaries, enabling traversal and global queries; (4) structured/SQL retrieval for countable facts. The unifying idea: retrieval must become multi-step and structure-aware, not a single similarity lookup.
#graphrag#multi-hop#agentic-rag#raptor
Expertconcept
Compare HyDE (Hypothetical Document Embeddings) and query2doc-style query expansion against vanilla dense retrieval. Why does generating a fake document before embedding improve recall, and when does it backfire?
HyDE asks an LLM to hallucinate a plausible answer document for the query, then embeds that document (not the query) and retrieves nearest neighbors. It works because the embedding space is trained on document-document similarity, so a synthetic document lives closer to real relevant documents than a short, lexically-sparse query does — it bridges the query-document asymmetry. It backfires on out-of-distribution or fact-specific queries the LLM gets wrong: a confidently hallucinated answer steers the embedding to the wrong neighborhood, hurting recall. It also adds latency and cost (an extra generation per query) and degrades when the corpus contains content the base model never saw.
#hyde#query-expansion#dense-retrieval#rag
Expertsystem-design
Design a multi-hop / agentic RAG system for queries like 'Which of our customers in the region with the highest churn also bought product X?' Contrast it with single-shot RAG and name the failure modes you must guard against.
Single-shot RAG embeds the whole query once and retrieves — it fails here because no single chunk answers it; the query decomposes into dependent sub-questions. Build an agentic loop: a planner LLM decomposes into sub-queries ('highest-churn region' → then 'customers there who bought X'), each sub-query triggers a retrieval (or a SQL/tool call), and results feed the next hop; a controller decides when enough evidence is gathered, then a synthesis step answers with citations. Guard against: error cascades (a wrong hop-1 answer poisons hop-2), unbounded loops (cap hops, add a budget), retrieval drift (re-ground each hop in the original question), context-window bloat from accumulated hops (summarize/prune), and non-determinism (temperature 0 on the planner, deterministic tool gates). Add per-hop verification before committing.