Question 1

What is an "embedding" in AI, in terms a web developer already understands?

Accepted Answer

An embedding is just an array of numbers that represents the meaning of some text. You send a string like "red running shoes" to an embedding model and get back something like [0.013, -0.42, 0.88, ...] — typically hundreds or thousands of floats. Think of it as a fingerprint for meaning: text that means similar things produces similar-looking arrays. You store these arrays and compare them later. That's the whole trick — turning fuzzy human meaning into a fixed-size numeric array your code can compare, like hashing, except similar inputs give similar (not wildly different) outputs.

Question 2

Why do similar meanings end up as 'nearby' arrays of numbers, and how does code measure that?

Accepted Answer

The embedding model is built so that text with similar meaning produces number-arrays that land close together — like points near each other on a map. To measure closeness, the most common tool is cosine similarity: a single helper that compares two arrays and returns roughly 1.0 for "basically the same meaning" down toward 0 for "unrelated." You don't need the math — just treat it like a built-in compare() that returns a score where higher means more alike. So "how do I reset my password" and "I forgot my login" score high together even with zero shared words. In your backend it's one function call, or it's handled for you by a vector database.

Question 3

What's the difference between an embedding model and a chat model like GPT or Claude?

Accepted Answer

They're different jobs. A chat model (as of 2026, e.g. GPT-5.x or Claude) takes text and generates new text — it talks back. An embedding model (e.g. OpenAI's text-embedding-3-small) takes text and returns numbers — it never writes prose, it just measures meaning. Analogy: a chat model is an endpoint that returns a written answer; an embedding model is an endpoint that returns a lookup key you use to find things. They're often used together: embeddings find the right documents, then you hand those documents to a chat model to write the answer. Different endpoints, different pricing, different purposes.

Question 4

How is semantic search different from a SQL LIKE or keyword search?

Accepted Answer

Keyword search (WHERE title LIKE '%laptop%', or a full-text index) matches characters and words. If the user types "notebook computer" but your data says "laptop," you get zero hits — it doesn't know they mean the same thing. Semantic search compares meaning via embeddings, so "notebook computer" finds "laptop" because their number-arrays are close. Think of LIKE as exact-string matching and semantic search as "find things that mean roughly this." The trade-off: keyword search is precise and free in your existing DB; semantic search handles synonyms, paraphrases, and typos but needs embeddings and a vector index. Many real apps combine both (called hybrid search).

Question 5

What actually IS an embedding? I keep hearing the word but have no mental model.

Accepted Answer

An embedding is a piece of text (a word, sentence, or paragraph) turned into an array of numbers, like [0.12, -0.84, 0.05, ...], usually a few hundred to a couple thousand numbers long. You send text to an embedding model (an API call, much like any REST endpoint) and get back that array. The magic part: the numbers are arranged so that text with similar MEANING produces similar arrays. Think of it as a fingerprint for meaning. "How do I reset my password?" and "I forgot my login" come back as nearby fingerprints, even though they share almost no words.

Question 6

Why do people say "similar meaning = nearby vectors"? What does "nearby" even mean for an array of numbers?

Accepted Answer

Picture each embedding as a point in space. With 2 numbers it's a dot on a graph (x, y); real embeddings just have hundreds of coordinates instead of 2, which you can't picture but the math handles fine. "Nearby" means the points sit close together, and closeness tracks meaning: "dog" and "puppy" land near each other, "dog" and "tax software" land far apart. So semantic search becomes a geometry question, find the points closest to my query point, rather than a keyword-matching question. That's the whole intuition; no formula needed.

Question 7

What does 'dimensions' loosely mean when people say an embedding is '1536-dimensional'?

Accepted Answer

Dimensions = how long the number array is. A 1536-dimensional embedding is literally an array of 1536 floats. Loosely, each slot captures some aspect of meaning the model learned — you can't read them individually; they aren't labeled "topic" or "tone." More dimensions can capture finer nuance but cost more to store and compare. As of 2026, common sizes are 768, 1024, 1536, and 3072. Practical takeaway: pick one model, and all your arrays must be the same length to compare them — like every row in a table needing the same columns. You can't compare a 768-long array to a 1536-long one.

Question 8

What is a vector database, and how is it like (and unlike) a search index I already use?

Accepted Answer

A vector database stores embeddings and quickly finds the ones closest in meaning to a query — even over millions of rows. It's like Elasticsearch or a normal DB index, but instead of indexing words it indexes meaning: you ask "which stored arrays are nearly the same as this one?" Under the hood it uses an approximate-nearest-neighbor index (you'll see the name HNSW) so it doesn't compare against every row one by one — same idea as a B-tree speeding up a WHERE, just for closeness in meaning. Examples as of 2026: pgvector (a Postgres extension), Pinecone, Weaviate, Qdrant, and Milvus.

Question 9

When should a fullstack dev actually reach for embeddings instead of plain SQL or full-text search?

Accepted Answer

Reach for embeddings when meaning matters more than exact words: search that understands synonyms and phrasing ("my package never arrived" finding "missing delivery" docs), "related articles / similar products" suggestions, deduping near-identical support tickets, grouping feedback by theme, or — the big one — RAG (Retrieval-Augmented Generation): finding the right docs to feed a chatbot so it answers from your data. Stick with SQL or keyword search when users search by exact IDs, codes, names, or tags, or when latency and cost must be near-zero. Rule of thumb: exact lookups go to SQL; "find me things like this" goes to embeddings.

Question 10

Roughly what does embedding cost, and how does that compare to chat models?

Accepted Answer

Embeddings are cheap. They're priced per million input tokens (a token is roughly three-quarters of a word) with no output cost, since you only get numbers back. As of 2026, e.g. OpenAI's text-embedding-3-small is about 0.02 per 1M tokens and text-embedding-3-large about 0.13 per 1M; rivals like Cohere Embed v4 and Voyage's voyage-3.5 are in a similar low range. Chat models cost dollars per 1M tokens, so embeddings are often far cheaper. The real cost surprise isn't the embedding call — it's re-embedding everything when you change models, plus the storage and RAM the arrays take in your vector DB. Open-source models (e.g. bge-m3) cost only your own compute.

Question 11

How is semantic search related to RAG and AI chatbots-over-my-data?

Accepted Answer

RAG (Retrieval-Augmented Generation) is the most common reason web devs adopt embeddings, and semantic search is its retrieval half. The flow: embed your knowledge base once; when a user asks a question, embed the question, semantic-search for the top few relevant chunks, then paste those chunks into the chat model's prompt as context and ask it to answer using only that. Embeddings find the facts; the chat model phrases the answer. This is how you make a chatbot answer from your private docs without retraining any model — it's basically "look it up, then write it up," with semantic search doing the lookup.

Question 12

People mention "cosine similarity" for comparing embeddings. Can you explain it without math?

Accepted Answer

Cosine similarity is just a score for "how alike are these two embeddings?" It looks at the DIRECTION each array points rather than how big the numbers are. Same direction means a score near 1 (very similar); unrelated means near 0; opposite means near -1. The practical upshot: you embed the user's query, then ask your vector store "which stored embeddings have the highest cosine similarity to this?" and it hands back the closest matches, ranked. You'll almost never compute it yourself, the database does it. Just read it as a 0-to-1 "how related" knob.

Question 13

What's the difference between an embedding model and a chat model like Claude or GPT? Aren't they both "AI"?

Accepted Answer

They're different tools with different jobs, like two separate REST endpoints. A chat model (e.g. Claude or GPT, as of 2026) takes text and WRITES text back, answers, summaries, conversation. An embedding model (e.g. OpenAI's text-embedding-3-small, as of 2026) takes text and returns that array of numbers, with no human-readable reply at all. You use them together in RAG: the embedding model finds the relevant documents by meaning, then you paste those docs into the chat model's prompt so it can answer using them. Embedding models are also far cheaper and faster, since they do less.

Question 14

Show me the end-to-end shape of building semantic search — what calls do I actually make?

Accepted Answer

Two phases. Indexing (once, ahead of time): for each document, call the embedding API and store the returned array next to the row, e.g. const v = await embed(doc.text); await db.insert({ id: doc.id, text: doc.text, embedding: v }). Querying (per search): embed the user's query the same way, then ask the vector DB for the nearest stored arrays, e.g. const qv = await embed(userQuery); const hits = await db.query({ vector: qv, topK: 5 }). That's it — embed at write time, embed at read time, compare. The vector DB does the "find closest" part. Use the same embedding model for both sides.

Question 15

What does an embedding API request and response actually look like?

Accepted Answer

It's a plain REST call, like any other API. With OpenAI's SDK (as of 2026): const res = await openai.embeddings.create({ model: "text-embedding-3-small", input: "How do I cancel my subscription?" }); const vector = res.data[0].embedding — an array of 1536 floats. You can pass an array of strings as input to embed many texts in one call (cheaper and faster — batch them). The response is just numbers; there's no "answer" to parse. Store vector wherever you keep embeddings. The same text gives effectively the same array every time, so you can safely cache results (like caching any pure function) to avoid re-paying.

Question 16

pgvector vs a dedicated vector DB like Pinecone — how do I choose?

Accepted Answer

pgvector is a Postgres extension: you run CREATE EXTENSION vector, add a vector column, and search with SQL. If you already run Postgres, it's the path of least resistance — one database, your existing backups, transactions, and JOINs with normal columns. Great up to roughly low-millions of vectors. Pinecone (and Weaviate, Qdrant, Milvus) are purpose-built and managed: they scale to billions of vectors and very high query rates with less hand-tuning. Choose pgvector to start and keep your stack simple; graduate to a dedicated vector DB when scale, query volume, or Postgres ops overhead start hurting. As of 2026, most teams start with pgvector and switch only if they must.

Question 17

Why must I use the same embedding model for stored documents and for the query — and what happens if I don't?

Accepted Answer

An embedding only makes sense relative to the model that produced it. Model A's array for "dog" and model B's array for "dog" live in totally different number spaces — comparing them is like comparing prices in different currencies without converting. If you embed your documents with text-embedding-3-small but embed queries with a different model (or even a different version), the similarity scores come out as garbage and search silently gets worse — no error, just bad results. So: pick one model, record which one in your schema, and if you ever switch, you must re-embed your entire dataset with the new model. Treat the model id like a database migration version.

Question 18

My documents are long — can I just embed a whole 50-page PDF as one vector?

Accepted Answer

You can, but you shouldn't, and often you literally can't past the model's input limit (around 8,000 tokens for many models). Cramming a huge document into one array blurs its meaning into mush — a search for one specific paragraph won't surface it because the single array averages everything together. The standard fix is chunking: split the document into smaller pieces (a few hundred tokens each, often with a little overlap so you don't cut a sentence's context in half), embed each chunk, and store chunk-level arrays that point back to the source doc. Then search returns the most relevant chunk. Chunking strategy quietly makes or breaks search quality — it's worth tuning.

Question 19

What's a metadata filter in a vector search, and why would I want one?

Accepted Answer

Pure vector search finds the most similar items globally — but you often need to scope it, e.g. "only this user's documents" or "only in-stock products." Vector DBs let you attach plain metadata (JSON-like key/values) to each stored array and filter on it during the search, like a WHERE clause running alongside the closeness ranking: db.query({ vector: qv, topK: 5, filter: { userId: "u_42", inStock: true } }). This keeps results both relevant and authorized. It matters most for security — without a userId filter, semantic search could surface another tenant's data. Treat metadata filtering as your access-control and business-rules layer on top of meaning.

Question 20

Does an embedding 'understand' my text, or actually store it? What's the common beginner misconception?

Accepted Answer

Neither, really. The array is a lossy numeric summary of meaning — not a copy of the text and not true understanding. You can't reverse an array back into the original sentence, so always store the original text alongside it. The misconception is treating embeddings like an encrypted or zipped version of your data you can decode later; you can't. Another trap: thinking a high similarity score means "correct" — it only means "close in meaning," which can still be the wrong answer. Embeddings are a retrieval tool that surfaces likely-relevant candidates; you still validate, filter, and (in RAG) let a chat model reason over them.

Embeddings & Semantic Search