Q: Why is cosine similarity preferred over Euclidean distance for comparing text embeddings? When can they be equivalent?

Cosine similarity, \cos\theta=\frac{a\cdot b}{\lVert a\rVert\lVert b\rVert}, measures direction not magnitude, so it ignores vector length — important because embedding norm often correlates with token frequency or document length rather than semantic content, and we want 'cat' close to 'kitten' regardless of magnitude. Euclidean distance conflates direction and magnitude. They become rank-equivalent when vectors are L2-normalized to unit length, since then \lVert a-b\rVert^2 = 2-2\cos\theta — minimizing Euclidean distance exactly maximizes cosine similarity. Hence many systems L2-normalize embeddings and can then use either metric interchangeably.

Q: Derive the TF-IDF weight for a term and explain why the IDF term uses a logarithm.

\text{tfidf}(t,d)=\text{tf}(t,d)\cdot\text{idf}(t) where \text{tf} is the term's count (or normalized count) in document d, and \text{idf}(t)=\log\frac{N}{df(t)} with N documents and df(t) the number containing t (often \log\frac{N}{1+df} to avoid division by zero). The log dampens IDF's dynamic range: raw N/df explodes for rare terms, so a term in 1 of 10^6 docs would dominate linearly; the log turns this multiplicative rarity into additive bits of information (an entropy-like surprise measure), giving diminishing returns so a moderately rare term isn't 1000× a common one. The product rewards locally frequent but globally rare terms.

Q: GloVe optimizes a weighted least-squares objective over the co-occurrence matrix. Write the objective and explain the weighting function f.

GloVe minimizes J=\sum_{i,j} f(X_{ij})\,(w_i^\top \tilde w_j + b_i + \tilde b_j - \log X_{ij})^2, where X_{ij} is how often word j appears in word i's context, w,\tilde w are word and context vectors, and b are biases. The model fits dot products to log co-occurrence counts, so ratios of co-occurrence probabilities become vector differences — capturing analogies. f(x)=\min((x/x_{max})^\alpha,1) (typically \alpha=0.75) down-weights rare, noisy pairs and, critically, caps the weight of extremely frequent pairs so stopword co-occurrences don't dominate, and f(0)=0 skips the zero entries that \log can't handle.

Question 1

What is the difference between static and contextual word embeddings? Give one example model of each.

Accepted Answer

Static embeddings assign each word type a single fixed vector regardless of context — word2vec, GloVe, and fastText are examples, so bank has one vector whether river or money. Contextual embeddings produce a different vector per occurrence based on surrounding tokens, computed from a model's hidden states — ELMo (biLSTM) and BERT/transformer encoders are examples, so bank in 'river bank' versus 'bank loan' gets distinct vectors. Contextual embeddings resolve polysemy and capture syntax/coreference that static vectors collapse, at the cost of a full forward pass per input rather than a lookup table.

Question 2

Why is cosine similarity preferred over Euclidean distance for comparing text embeddings? When can they be equivalent?

Accepted Answer

Cosine similarity, \cos	heta=\frac{a\cdot b}{\lVert aVert\lVert bVert}, measures direction not magnitude, so it ignores vector length — important because embedding norm often correlates with token frequency or document length rather than semantic content, and we want 'cat' close to 'kitten' regardless of magnitude. Euclidean distance conflates direction and magnitude. They become rank-equivalent when vectors are L2-normalized to unit length, since then \lVert a-bVert^2 = 2-2\cos	heta — minimizing Euclidean distance exactly maximizes cosine similarity. Hence many systems L2-normalize embeddings and can then use either metric interchangeably.

Question 3

Scenario: a managed cloud NLP service must extract organization names, dates, and locations from contracts. Which task is this, and how does it differ from POS tagging?

Accepted Answer

This is Named Entity Recognition (NER) — a sequence-labeling task that identifies and classifies spans of text into entity types (ORG, DATE, LOC/GPE, PERSON, etc.), typically encoded with a BIO/IOB scheme (B-ORG, I-ORG, O) so multi-token entities like 'Wolters Kluwer' are captured as one span. POS tagging is also sequence labeling but assigns each token a grammatical category (noun, verb, adjective, determiner) — it operates per-token with no notion of multi-token spans or semantic type. NER answers 'what real-world entity is this?'; POS answers 'what syntactic role does this word play?'. Managed services exposing this include AWS Comprehend, Azure AI Language, and Google Cloud Natural Language.

Question 4

Explain Byte-Pair Encoding (BPE) tokenization. How is the vocabulary learned and how does it tokenize a new word?

Accepted Answer

BPE starts from a base vocabulary (characters or bytes) and greedily learns merge rules: it counts adjacent symbol-pair frequencies across the corpus and repeatedly merges the most frequent pair into a new symbol, recording each merge, until reaching a target vocab size. To tokenize new text, it splits into base units and applies the learned merges in the same learned order, producing subword tokens. This keeps common words as single tokens while decomposing rare words into known subwords, eliminating true OOV (anything decomposes to characters/bytes). GPT-2's byte-level BPE operates on UTF-8 bytes so any Unicode string is representable with a 256-symbol base.

Question 5

Contrast the CBOW and skip-gram training objectives in word2vec. Which is better for rare words and why?

Accepted Answer

Both learn embeddings by predicting context. CBOW predicts the center word from the average of its surrounding context vectors — one prediction per window, so it trains faster and smooths over the context. Skip-gram does the inverse: predict each context word from the center word, generating multiple (center, context) training pairs per window. Skip-gram is better for rare words and small corpora because each rare word produces several independent gradient updates as a center word rather than being averaged away inside a context bag, giving it more learning signal. CBOW is faster and tends to do slightly better on frequent words.

Question 6

Derive the TF-IDF weight for a term and explain why the IDF term uses a logarithm.

Accepted Answer

ext{tfidf}(t,d)=	ext{tf}(t,d)\cdot	ext{idf}(t) where 	ext{tf} is the term's count (or normalized count) in document d, and 	ext{idf}(t)=\log\frac{N}{df(t)} with N documents and df(t) the number containing t (often \log\frac{N}{1+df} to avoid division by zero). The log dampens IDF's dynamic range: raw N/df explodes for rare terms, so a term in 1 of 10^6 docs would dominate linearly; the log turns this multiplicative rarity into additive bits of information (an entropy-like surprise measure), giving diminishing returns so a moderately rare term isn't 1000× a common one. The product rewards locally frequent but globally rare terms.

Question 7

What does SentencePiece do differently from BPE/WordPiece, and why is treating whitespace as a symbol important?

Accepted Answer

SentencePiece operates directly on raw text without language-specific pre-tokenization (no assumption that spaces separate words), making it language-agnostic and crucial for languages like Japanese/Chinese that don't use spaces. It escapes whitespace as a meta-symbol ▁ (U+2581) and includes it in the vocabulary, so detokenization is fully reversible and lossless — you can reconstruct the exact original string including spacing, which space-splitting tokenizers cannot. It can run either a BPE or a Unigram LM algorithm under the hood. The Unigram model probabilistically prunes a large seed vocabulary to maximize corpus likelihood and supports subword regularization (sampling alternate segmentations) for robustness.

Question 8

How does WordPiece differ from BPE in choosing which pair to merge, and what is the '##' convention?

Accepted Answer

BPE merges the pair with highest raw co-occurrence frequency. WordPiece (used by BERT) instead merges the pair that most increases the training-corpus likelihood under a unigram language model — equivalently it maximizes \frac{	ext{count}(xy)}{	ext{count}(x)\,	ext{count}(y)}, a pointwise-mutual-information-like score, so it favors pairs that occur together more than chance rather than merely often. The ## prefix marks a subword that continues a word (e.g. play, ##ing), distinguishing word-internal pieces from word-initial ones so the original spacing can be reconstructed and word boundaries stay encoded. Both yield subword vocabularies that eliminate hard OOV.

Question 9

GloVe optimizes a weighted least-squares objective over the co-occurrence matrix. Write the objective and explain the weighting function f.

Accepted Answer

GloVe minimizes J=\sum_{i,j} f(X_{ij})\,(w_i^	op 	ilde w_j + b_i + 	ilde b_j - \log X_{ij})^2, where X_{ij} is how often word j appears in word i's context, w,	ilde w are word and context vectors, and b are biases. The model fits dot products to log co-occurrence counts, so ratios of co-occurrence probabilities become vector differences — capturing analogies. f(x)=\min((x/x_{max})^\alpha,1) (typically \alpha=0.75) down-weights rare, noisy pairs and, critically, caps the weight of extremely frequent pairs so stopword co-occurrences don't dominate, and f(0)=0 skips the zero entries that \log can't handle.

Question 10

How do subword tokenizers eliminate OOV, and what failure modes replace the classic OOV problem?

Accepted Answer

Subword/byte-level tokenizers (BPE, WordPiece, byte-level BPE, SentencePiece-Unigram) guarantee any string decomposes into in-vocabulary units down to characters or the 256 bytes, so no token is truly unknown — the classic <UNK> problem disappears. New failure modes: (1) over-fragmentation — rare words, code, math, or low-resource languages shatter into many tokens, inflating sequence length, cost, and degrading modeling of those inputs; (2) tokenization artifacts — numbers, whitespace, and morphology split inconsistently, hurting arithmetic and reasoning; (3) glitch/under-trained tokens (e.g. SolidGoldMagikarp) that map to near-random embeddings; (4) cross-lingual token-budget inequity. So OOV becomes a quality/efficiency problem, not a coverage gap.

Question 11

Why does naive mean-pooling of BERT's token embeddings produce poor sentence embeddings, and what does Sentence-BERT change?

Accepted Answer

Off-the-shelf BERT was trained for masked-LM and next-sentence prediction, not to make sentence vectors whose geometry reflects semantic similarity; its token-vector space is anisotropic (squeezed into a narrow cone), so mean-pooled vectors give high cosine similarity even for unrelated sentences and underperform averaged GloVe. Sentence-BERT (SBERT) fine-tunes BERT in a siamese/triplet setup on NLI/STS pairs with a pooling layer, optimizing so that cosine similarity of pooled outputs matches labeled semantic similarity. This yields a metric space where cosine is meaningful and lets you precompute embeddings and compare with a dot product — turning an O(n^2) cross-encoder comparison into fast nearest-neighbor search.

Question 12

What is the negative-sampling objective in skip-gram, and why is it used instead of the full softmax?

Accepted Answer

The full softmax normalizes over the entire vocabulary V, costing O(|V|) per training example — prohibitive for millions of words. Negative sampling reframes it as binary classification: for each true (word, context) pair, maximize \log\sigma(w\cdot c), and for k sampled 'negative' words drawn from a noise distribution, maximize \sum\log\sigma(-w\cdot c_{neg}). This costs O(k) instead of O(|V|). Negatives are sampled from the unigram distribution raised to the 3/4 power, which boosts rarer words relative to frequent ones. It approximately optimizes the same embeddings (it implicitly factorizes a shifted PMI matrix) at a fraction of the cost; hierarchical softmax is the alternative.

Question 13

You observe high token-level cosine similarity between embeddings of unrelated frequent words in a transformer. Explain the anisotropy/representation-degeneration cause and one mitigation.

Accepted Answer

Transformer hidden states concentrate in a narrow cone of the embedding space (anisotropy): the expected cosine between random token vectors is far above zero, so even unrelated words look similar and similarity loses discriminative power. A driver is representation degeneration from tied softmax/output embeddings — rare tokens get pushed in a shared 'common' direction during training to minimize their logits, and the dominant singular direction crowds the space. Mitigations: whitening / all-but-the-top (remove the top principal components and standardize), spectral or contrastive regularization (e.g. SimCSE) that spreads vectors over the sphere, or task fine-tuning like SBERT. These flatten the singular-value spectrum and restore isotropy so cosine becomes meaningful.

Question 14

Design an embedding-based semantic retrieval system over 50M documents. Address chunking, model choice, ANN indexing, and the static-vs-cross-encoder tradeoff.

Accepted Answer

Use a bi-encoder (dual-encoder) sentence model to precompute one vector per chunk: split docs into overlapping ~200–500 token chunks on semantic/sentence boundaries so a single embedding stays coherent. Embed query and chunks into the same space, L2-normalize, and rank by cosine/dot product. At 50M vectors, exact search is too slow, so use an ANN index — HNSW (graph) or IVF-PQ (quantized, lower memory) in FAISS or a vector DB — trading a little recall for sublinear latency, with PQ compressing vectors to fit RAM. Bi-encoders are cheap (precomputable, O(1) per-query lookup) but less accurate; add a two-stage pipeline: retrieve top-k with the bi-encoder, then re-rank that small set with a cross-encoder that jointly attends over query+doc for precision. Tune chunk overlap, k, and HNSW efSearch; evaluate with recall@k and nDCG.

NLP, Tokenization & Embeddings