ML System Design (RecSys, Ranking, Search, Ads)

Designing recommendation, ranking, search, and ads systems for ML interviews.

Study these interactively →
Foundationalconcept

What is the standard two-stage architecture of a large-scale recommendation system, and why is it used instead of scoring every item with one model?

The funnel splits into candidate generation (retrieval) and ranking. Retrieval cheaply narrows millions of items to hundreds using lightweight methods (ANN over embeddings, co-occurrence, popularity), optimizing recall. Ranking applies a heavy, feature-rich model to those few hundred, optimizing precision/ordering. A single model can't score the full catalog under a tens-of-ms latency budget. The split spends compute where it matters: cheap-and-broad first, expensive-and-precise last. Often a third re-ranking stage adds diversity and business rules.
#recsys#funnel#candidate-generation#ranking
Foundationalconcept

Explain collaborative filtering and how it differs from content-based filtering.

Collaborative filtering (CF) recommends items from the behavior of many users — patterns in the user-item interaction matrix — without needing item attributes. User-based CF: 'people like you liked X'; item-based CF: 'items co-liked with what you liked.' Content-based filtering uses item features (genre, text, tags) and a user's own history, recommending items similar in feature space. CF captures latent taste and serendipity but suffers cold start (new users/items lack interactions); content-based handles cold start better but overspecializes and rarely surfaces unexpected items.
#recsys#collaborative-filtering#content-based#cold-start
Intermediatesystem-design

How would you frame the first five minutes of an ML system design interview (e.g. 'design YouTube recommendations')?

Clarify objective and scope before architecture: (1) business goal and ML objective — what to optimize (watch time, CTR, retention) and how it maps to a label; (2) constraints — QPS, latency budget, catalog size, user count; (3) surface (homepage feed vs related videos), which changes candidates; (4) available data and feedback loop. Then state assumptions, propose the two-stage funnel, name online and offline metrics, and only then dive into retrieval, features, model, training, serving, and evaluation. Drive the conversation; restate the problem as an ML problem.
#interview#framing#system-design#objective
Intermediateconcept

Describe the two-tower model for retrieval. What constraint does it impose and why does that enable scale?

Two separate networks — a user/query tower and an item tower — each produce an embedding; relevance is their dot product (or cosine). The hard constraint: no early interaction between user and item features; they meet only at the final dot product. That enables scale because item embeddings are query-independent — precompute all item vectors offline, build an ANN index, and at serve time run the user tower once then do a fast nearest-neighbor lookup. A cross-feature model (user×item concatenation) would require re-scoring every item per request: infeasible for retrieval, fine for the small ranking set.
#two-tower#retrieval#embeddings#ann
Intermediateconcept

Distinguish online (real-time) from batch features in a ranking system. Give an example of each and explain the freshness/cost tradeoff.

Batch features are precomputed on a schedule (hourly/daily) from historical data — e.g. a user's 30-day average watch time, an item's lifetime CTR. Cheap to serve (a lookup) but stale. Online features come from very recent events at low latency — e.g. items clicked in the last 5 minutes this session, or request context (device, time, query). They capture recency/intent but need streaming infra (Kafka/Flink) and add serving latency and skew risk. Real-time features lift relevance for fast-changing intent (session, trending); batch features are stable and cheap. Most systems blend both.
#feature-store#online-features#batch-features#streaming
Intermediatesystem-design

Design the metrics for an A/B test of a new feed-ranking model. Why can't you ship on offline AUC alone, and what guardrails do you set?

Offline metrics (AUC, NDCG) measure rank quality on logged data but carry feedback-loop bias (the log reflects the old model's choices), miss system effects (latency, diversity, novelty), and may not track the true business objective. Run an online A/B on the north-star metric (long-term engagement, sessions, revenue) with power analysis and a fixed horizon. Set guardrails: p99 latency, error rate, and counter-metrics (dwell-time quality, complaint/unfollow rate, diversity) so you don't win CTR while harming retention. Watch novelty effects (run long enough), interference/network effects, and use sequential corrections to avoid peeking.
#ab-testing#metrics#guardrails#ndcg
Intermediateconcept

Compare pointwise, pairwise, and listwise learning-to-rank. When does each matter for a ranking model?

Pointwise scores each item independently (regression/classification) — simple and scalable but ignores relative order. Pairwise (RankNet, LambdaRank) learns from pairs, penalizing a less-relevant item outranking a more-relevant one — directly models order, good for top-k. Listwise (ListNet, LambdaMART optimizing NDCG) optimizes a loss over the whole list, aligning best with position-weighted metrics. In practice pointwise dominates large-scale CTR (calibration + scale), while pairwise/listwise win when relative ordering and position-weighted metrics are the explicit objective and lists are short.
#learning-to-rank#ltr#lambdamart#ndcg
Intermediatesystem-design

Walk through how you'd handle the three cold-start problems: new user, new item, new system. Give a concrete tactic for each.

New user: no history → use signup context (geo, device, declared interests) plus popular/trending items, and explore quickly (bandit) to learn taste; a context-only two-tower user tower still yields an embedding. New item: no interactions → embed from metadata/text/image so it lands near similar items in retrieval space, and boost via exploration to gather feedback. New system: bootstrap with content-based logic or rules/heuristics, import or buy data, and log heavily from day one so CF becomes viable. Across all three, exploration (epsilon-greedy/Thompson) is the unifying tool.
#cold-start#exploration#bandits#content-features
Intermediateconcept

How does search ranking differ from recommendation ranking, and what does the explicit query change about features and evaluation?

Search has an explicit query, so query relevance is dominant — you need query-document matching features (BM25/lexical, semantic/embedding similarity) and query understanding (intent classification, entity recognition, spelling). Recommendation has no query; intent is inferred from history/context, so personalization and behavioral signals dominate. Search blends lexical + semantic (hybrid retrieval) and is judged on query-relevance metrics (NDCG with human judgments, MRR) plus engagement; recsys leans on engagement/conversion since there's no query-relevance ground truth. Search must also handle tail/zero-result queries and exact-match expectations that recsys never faces.
#search#query-understanding#bm25#hybrid-retrieval
Advancedmath

Derive the matrix factorization objective for implicit feedback and explain why explicit-rating loss is wrong here.

Factor the interaction matrix $R \approx U V^\top$ with latent vectors $u_i, v_j \in \mathbb{R}^k$. For implicit feedback (clicks/views) we observe only presence/absence, and 'no interaction' is not 'dislike' (it may be unseen). Hu-Koren-Volinsky WMF sets preference $p_{ij}=\mathbb{1}[r_{ij}>0]$ and confidence $c_{ij}=1+\alpha r_{ij}$, minimizing $\sum_{ij} c_{ij}(p_{ij}-u_i^\top v_j)^2 + \lambda(\|u_i\|^2+\|v_j\|^2)$ over ALL pairs. Plain MSE on observed ratings is wrong: it treats missing entries as missing-at-random and ignores that unobserved entries carry weak negative signal, confidence-weighted.
#matrix-factorization#implicit-feedback#wmf#loss
Advancedconcept

Why are in-batch negatives standard for training two-tower retrieval models, and what bias do they introduce that you must correct?

Sampling explicit negatives per positive is expensive; instead, within a minibatch the other examples' items serve as negatives (a softmax over the batch) — cheap, with many negatives per step. The bias: items appear as negatives proportional to their frequency, so popular items are over-penalized. Fix with the logQ correction — subtract $\log Q(j)$ (estimated sampling probability of item $j$) from each logit, sampled-softmax style. Without it, popular items are unfairly suppressed in retrieval. Teams also add mined hard negatives for sharper decision boundaries.
#in-batch-negatives#two-tower#sampled-softmax#logq
Advancedconcept

You're designing CTR prediction for ads. Why is calibration critical here in a way it isn't for a pure ranking feed, and how do you achieve and measure it?

Ad auctions rank by eCPM = bid × pCTR and bill against predicted value, so the absolute probability must be accurate, not just the ordering — a miscalibrated pCTR mis-ranks across advertisers with different bids and mis-prices. Calibrate via Platt scaling or isotonic regression on a holdout, and correct negative downsampling by rescaling: $p = \frac{p_s}{p_s + (1-p_s)/w}$, where $w$ is the negative downsampling rate. Measure with reliability/calibration plots (predicted vs observed CTR per bucket), ECE, and PCOC (predicted-to-actual click ratio ≈ 1). Pure relevance feeds need only correct order, so calibration matters less.
#ctr#calibration#ads#auction
Advancedconcept

What is training-serving skew, name three concrete causes in a RecSys, and how do feature stores mitigate it?

Training-serving skew is when a feature's value differs between training and inference, degrading online performance despite good offline metrics. Causes: (1) different code paths — a batch SQL job for training vs Java/Python at serving, with subtle logic drift; (2) time-travel leakage — training uses feature values from after the label event that aren't available at serve time; (3) stale/missing online features (fresh in training, hours old in production). A feature store mitigates by serving one feature definition both offline (point-in-time-correct historical joins) and online (low-latency lookup) from the same transformation, guaranteeing consistency.
#training-serving-skew#feature-store#leakage#mlops
Advancedconcept

Compare HNSW and IVF-PQ as ANN indexes, and explain when you'd pick each for serving embeddings at scale.

HNSW (hierarchical navigable small world) is a graph index: very high recall and low latency, but memory-heavy (full vectors + graph) and costlier to build/update — best when the index fits in RAM and you need top accuracy. IVF-PQ (inverted file + product quantization) clusters vectors into cells (search only nprobe cells) and compresses each vector into quantized codes, slashing memory at some recall cost — best for billion-scale catalogs where RAM is the constraint. Pick HNSW for up to tens of millions of vectors needing high recall/low latency; IVF-PQ or hybrids (HNSW+PQ) for very large catalogs or tight memory. Tune nprobe/efSearch for the recall-latency tradeoff.
#ann#hnsw#ivf-pq#faiss
Advancedsystem-design

Given a 50ms end-to-end latency budget for a ranking request, how do you allocate it across the funnel and what techniques keep ranking within budget?

Rough split: gateway/network ~5ms, feature fetch ~10-15ms (parallel online+batch lookups, the usual bottleneck), retrieval/ANN ~5-10ms, ranking model ~10-15ms, re-rank + serialization ~5ms. Techniques: cap candidates entering ranking (e.g. 500, not 5000); score them in one batched forward pass; precompute item-side features and embeddings; cache user features per request; quantize/distill the ranking model; fetch features in parallel with per-call timeouts and graceful degradation (skip a slow feature rather than miss SLA). Optimize for p99, not mean — tail latency drives experience and triggers fallbacks.
#latency#serving#p99#feature-fetch
Advancedconcept

In a feature store, explain precisely how a naive offline join between a labels table and a feature table introduces label leakage, and how point-in-time-correct ("as-of") joins prevent it. What columns must the feature table carry to make this possible?

A naive join matches each label row to feature values by entity key only, so it grabs whatever the feature value is *now* (or the latest available), which may have been computed *after* the label's event timestamp — leaking future information the model won't have at serving time. A point-in-time join instead matches each label's event_timestamp to the most recent feature row whose own event_timestamp is $\le$ the label timestamp (often within a max-age window). This requires every feature row to carry an entity key plus an event/observation timestamp (and ideally a created/arrival timestamp to model ingestion delay). Without those timestamps the as-of join is impossible and skew/leakage is inevitable.
#feature-store#leakage#point-in-time-join#training-serving-skew
Expertconcept

What is position bias in ranking, why does naively training on click logs entrench it, and how do you correct it?

Users click higher-ranked items more regardless of relevance, so clicks confound relevance with position. Training on raw clicks teaches 'top items get clicked,' reinforcing whatever the old model surfaced — a self-fulfilling loop that buries good items shown low. Corrections: (1) Inverse Propensity Scoring — weight each click by $1/p(\text{examine at position }k)$, with propensities from result randomization or intervention harvesting; (2) position-as-feature — include position at train time, then zero/neutralize it at serve time; (3) explicit exploration for unbiased data. Under the examination hypothesis, IPS yields an unbiased relevance estimator.
#position-bias#ips#counterfactual#unbiased-ltr
Expertsystem-design

Your offline NDCG improved 4% but the online A/B was flat or negative. List the most likely causes and how you'd diagnose each.

Likely causes: (1) training-serving skew — log served feature vectors and compare to training. (2) Off-policy/feedback bias — offline eval scored on data biased to the old policy; the new ranker reorders into regions with no logged labels; use IPS/off-policy estimators or interleaving. (3) Metric misalignment — NDCG on judgments, not the real business objective. (4) System effects — added latency hurts engagement; check p99 deltas. (5) Position/selection bias in offline labels. (6) Novelty/learning effects with too short a horizon. (7) Diversity collapse harming sessions despite higher per-item relevance. Interleaving is far more sensitive than A/B for ranking and a good first diagnostic.
#offline-online-gap#off-policy#interleaving#debugging
Expertsystem-design

Design a multi-objective ranking system balancing CTR, watch time, and a 'don't recommend clickbait' constraint. How do you combine objectives without one dominating?

Use a multi-task model (shared-bottom or MMoE/PLE) with per-objective heads — pClick, pWatchTime, pSatisfaction/pComplaint — so sparse tasks borrow signal. Combine at scoring time, often multiplicatively to respect scale differences: $\text{score} = pClick \cdot pWatch^{\alpha} \cdot \text{quality}^{\beta} \cdots$, with weights tuned via online A/B (or Bayesian/Pareto search) against the north-star and guardrails, not hand-set. Keep clickbait suppression as a hard penalty or negative term so high pCTR can't override it. MMoE/PLE limits task conflict in the shared layers. Calibrate each head so the combiner weights are meaningful.
#multi-task#mmoe#multi-objective#value-model
Expertconcept

Why do exploration and the explore/exploit tradeoff matter structurally in RecSys, and what's the risk of a purely greedy 'best estimated reward' policy?

A recsys only observes feedback on items it shows, so a greedy policy is a closed loop: it keeps surfacing items it already rates highly, never gathering data on under-shown items, whose estimates stay stale — good-but-unexplored items get permanently buried (rich-get-richer, filter bubbles, no recovery from early mis-estimates). Exploration (epsilon-greedy, Thompson sampling, UCB, contextual bandits) deliberately shows uncertain items to gather unbiased signal — essential for cold-start items and to keep the training distribution from collapsing. The cost is short-term regret traded for long-term catalog coverage, fresher models, and unbiased data.
#exploration#bandits#thompson-sampling#feedback-loop
Expertconcept

Why does a single-vector dot-product two-tower model retrieve worse than expected at the head/tail, and how does it fail relative to lexical retrieval?

Dot-product retrieval optimizes average similarity, so it suffers folding (distant items collapse to similar embeddings, creating false neighbors), popularity bias from in-batch negatives, and weak handling of rare/exact terms — semantic embeddings blur specific tokens, so an exact SKU, name, or numeric query that BM25 nails can be missed. A single vector also represents fine-grained conjunctions poorly. Mitigations: hybrid retrieval (dense + lexical via reciprocal-rank fusion), multi-vector late-interaction models (ColBERT) for term-level matching, hard-negative mining, and logQ popularity correction. The single-vector bottleneck is the structural limit; lexical complements it on exactness and tail.
#dense-retrieval#hybrid#colbert#embeddings
Expertsystem-design

You're adding an LLM-powered conversational 'help me choose' assistant over a product catalog. Design it for correctness, latency, and safety.

Don't let the LLM free-associate over the catalog: retrieve candidates deterministically (the two-stage funnel or a SQL filter) and pass only that grounded set into the prompt — the model selects and explains, code disposes. Validate every output server-side against an allow-list of real SKUs, recompute prices/facts from the catalog (never trust LLM numbers), and HTML-escape output. Wrap untrusted user text in a delimiter, treated as data not instructions (prompt-injection defense). For latency: cache, stream tokens, keep the model off the hot retrieval path. Fail closed — never emit an empty/200 or hallucinated item; degrade to the standard recommender. Force a reasoning field then structured JSON at temperature 0, and A/B test against the non-LLM baseline on conversion + complaint guardrails.
#llm#grounding#prompt-injection#recsys
Expertsystem-design

A ranking model is trained from the offline (warehouse) store but serves from the online (key-value) store. List the distinct mechanisms by which training-serving skew can creep in even when the same feature *definitions* are used in both, and how you'd detect each.

Skew sources: (1) different code paths — SQL transform offline vs streaming/Python online — causing logic drift; fix by sharing one transformation definition (e.g. compiled from the same spec). (2) Time-travel mismatch — offline uses as-of joins, online reads latest materialized value that may be staler or fresher than training implied. (3) Backfill vs streaming differences in aggregation windows or late-arriving data. (4) Type/encoding/null-handling differences (float64 vs float32, default imputation). (5) Materialization lag making online values older than the training distribution. Detect via logging the *actual* served feature vectors and comparing their distribution (PSI/KL) against the training set, plus replaying logged requests through the offline pipeline and asserting byte-equality.
#feature-store#training-serving-skew#online-offline-consistency#monitoring
Expertsystem-design

You have a feature aggregated over a 30-day window (e.g. user purchase count) serving a high-QPS recommender. Discuss the freshness-vs-cost-vs-correctness tradeoffs across materialization strategies: batch backfill, incremental/streaming, and on-demand (request-time) computation. When is each correct?

Batch backfill (recompute the full window on a schedule) is cheap and simple but stale — the window can be up to one schedule-interval old, and recomputing 30 days repeatedly wastes compute; correct for slow-moving features where hours of staleness is fine. Streaming/incremental maintains the aggregate as events arrive (sliding/tumbling windows or decay), giving seconds-level freshness at higher infra and correctness cost (windowing edge cases, exactly-once, out-of-order/late events, state size). On-demand computes at request time from raw recent events — freshest and storage-light but adds serving latency and can't easily reproduce the training-time value unless the same logic ran offline. Choose by freshness SLA, write volume, and whether point-in-time reproducibility is required; often a hybrid (batch base + streaming delta) wins.
#feature-store#materialization#feature-freshness#streaming#cost