Model Evaluation & Validation

Model Evaluation & Validation — interview questions and answers with clear explanations.

Study these interactively →
Foundationalconcept

What is the bias-variance tradeoff, and how does it relate to a model's expected test error?

Expected test error at a point decomposes as $\text{Bias}^2 + \text{Variance} + \sigma^2$ (irreducible noise). Bias is error from wrong assumptions — the model is too simple to capture the signal (underfitting). Variance is sensitivity to the particular training sample — the model fits noise (overfitting). Increasing complexity lowers bias but raises variance; the tradeoff is the sweet spot minimizing their sum. Irreducible noise sets a floor you cannot beat regardless of model choice.
#bias-variance#overfitting#test-error#generalization
Foundationalconcept

How do you distinguish overfitting from underfitting using train and validation curves?

Underfitting: both training and validation error are high and close together — the model lacks capacity to fit the signal. Overfitting: training error is low but validation error is much higher — a large generalization gap that widens with more training or capacity. Fixes differ: underfitting needs more capacity, features, or training; overfitting needs regularization, more data, simpler models, or early stopping. Plot error vs complexity or epochs and watch the gap and the validation minimum.
#overfitting#underfitting#learning-curves#regularization
Foundationalmath

Define precision, recall, and F1, and write the formulas in terms of TP, FP, FN.

Precision $=TP/(TP+FP)$ — of items predicted positive, the fraction truly positive (penalizes false alarms). Recall $=TP/(TP+FN)$ — of actual positives, the fraction caught (penalizes misses). F1 is their harmonic mean: $F_1 = 2\cdot\frac{P\cdot R}{P+R}$. The harmonic mean punishes imbalance between P and R, so F1 is high only when both are high. F1 ignores true negatives entirely, which suits imbalanced positive-detection tasks where TN dominate.
#precision#recall#f1#confusion-matrix
Foundationalconcept

What does each cell of a binary confusion matrix represent, and which metrics are derived from only a subset of cells?

With actual as rows and predictions as columns: TP (pred+, actual+), FP (pred+, actual−, type I error), FN (pred−, actual+, type II error), TN (pred−, actual−). Accuracy uses all four: $(TP+TN)/N$. Precision and recall ignore TN. Specificity (TNR) $=TN/(TN+FP)$ ignores positives. Key insight: under heavy imbalance TN dominates and inflates accuracy, so metrics that exclude TN (precision/recall/F1) reveal true positive-detection performance.
#confusion-matrix#accuracy#type-i-error#type-ii-error
Intermediateconcept

Why is k-fold cross-validation preferred over a single train/validation split, and what does k control?

A single split gives a high-variance estimate that depends on which points land in validation. k-fold partitions data into k folds, trains on k−1 and validates on the held-out fold, rotating k times, then averages — every point serves as both train and validation, lowering estimate variance. k trades off bias and variance of the estimate: small k (e.g. 5) uses less training data per fold (slight pessimistic bias, lower variance, cheaper); large k / LOOCV uses nearly all data (low bias, higher variance, expensive). k=5 or 10 is the usual compromise.
#cross-validation#k-fold#loocv#variance
Intermediateconcept

When should you use stratified k-fold, grouped k-fold, and time-series (forward-chaining) cross-validation?

Stratified k-fold preserves the class distribution in each fold — essential for imbalanced classification so no fold loses a rare class. Grouped k-fold keeps all records sharing a group key (same patient, user, session) in one fold, preventing correlated rows from straddling the split. Time-series CV trains on past and validates on future (expanding or rolling window), never shuffling, because random folds would leak future information into the past.
#stratified#grouped-cv#time-series-cv#leakage
Intermediateconcept

Contrast ROC-AUC with PR-AUC and explain why PR-AUC is preferred under heavy class imbalance.

ROC-AUC plots TPR vs FPR; PR-AUC plots precision vs recall. ROC-AUC is largely insensitive to class balance because FPR $=FP/(FP+TN)$ uses the large TN pool as denominator — under 1:1000 imbalance, even thousands of false positives barely move FPR, so ROC-AUC stays optimistically high. PR-AUC's precision $=TP/(TP+FP)$ directly exposes that FP flood. So when positives are rare and false positives costly, PR-AUC reflects practical performance; ROC-AUC can look great while the model is unusable. PR-AUC's baseline equals the positive prevalence.
#roc-auc#pr-auc#imbalance#threshold
Advancedconcept

What is model calibration, how do you measure it, and why can a high-AUC model still be poorly calibrated?

Calibration means predicted probabilities match empirical frequencies: of all events scored 0.7, about 70% should be positive. Measure with a reliability diagram (binned predicted vs observed rate), Expected Calibration Error (weighted mean gap across bins), or Brier score. AUC measures only ranking/discrimination — whether positives score above negatives — and is invariant to any monotonic rescaling of scores. A model can rank perfectly (AUC 1.0) yet output systematically inflated probabilities, so AUC says nothing about calibration. Fix with Platt scaling or isotonic regression on a held-out set.
#calibration#reliability-diagram#ece#platt-scaling#brier
Advancedmath

Compare RMSE, MAE, and R² for regression. When does each mislead, and what does R² actually represent?

MAE is mean absolute error — robust, in target units, treats all errors linearly. RMSE squares errors, so it penalizes large errors more and is sensitive to outliers; RMSE≥MAE always, and a big gap signals heavy-tailed residuals. R² $=1-SS_{res}/SS_{tot}$ is the fraction of variance explained relative to predicting the mean; it can go negative for models worse than the mean and is inflated by adding features (use adjusted R²). R² misleads when comparing datasets with different target variance, and on low-variance targets it can look poor despite small absolute error.
#rmse#mae#r-squared#adjusted-r2#outliers
Advancedconcept

Give three concrete forms of data leakage and explain how to prevent each.

Target leakage: a feature encodes the outcome (e.g. 'days_until_payment' for default prediction, or a field populated after the label) — drop features unavailable at prediction time. Train-test contamination: fitting scalers, imputers, encoders, or feature selection on the full dataset before splitting leaks test statistics — fit all preprocessing inside the CV fold (pipeline). Temporal/group leakage: random splitting time-ordered or grouped data lets correlated/future rows appear in both sets — use time-aware or grouped splits. Symptom: validation looks great, production collapses.
#data-leakage#target-leakage#pipeline#preprocessing
Advancedconcept

Why must hyperparameter tuning use a separate validation set (or nested CV) rather than the test set, and what bias arises if you don't?

Repeatedly selecting models by test performance optimizes to that specific set, so the test score becomes an optimistic, biased estimate of generalization — you've effectively trained on it via model selection (the multiple-comparisons / 'leaderboard overfitting' problem). Fix: a train/val/test three-way split — tune on val, report once on the untouched test set. For small data use nested CV: an inner loop tunes hyperparameters, an outer loop estimates unbiased generalization. The outer test fold never informs any modeling decision.
#nested-cv#hyperparameter-tuning#test-set#selection-bias
Advancedsystem-design

You have 1% positives. A model has 99% accuracy. Walk through which metrics to trust and how to pick a decision threshold.

99% accuracy is the no-skill baseline (predict all negative), so accuracy is useless here. Use precision, recall, F1, and PR-AUC, which ignore the dominating TN. Pick the operating threshold from business costs, not the default 0.5: if missing positives is expensive (fraud, disease) bias toward recall; if false alarms are costly bias toward precision; the precision-recall curve or an $F_\beta$ ($\beta>1$ weights recall) makes the tradeoff explicit. Calibrate probabilities if downstream decisions use them, and consider cost-sensitive learning or resampling.
#imbalance#threshold#f-beta#pr-curve#accuracy-paradox
Expertmath

Derive why the F1 score (harmonic mean) is more conservative than the arithmetic mean of precision and recall, and state the consequence.

For positive P and R, the AM-HM inequality gives $\frac{2PR}{P+R}\le\frac{P+R}{2}$, with equality iff $P=R$. The harmonic mean is dominated by the smaller value: if $P=0.9,R=0.1$, AM=0.5 but F1$=2(0.09)/1.0=0.18$. Consequence: F1 cannot be high unless both precision and recall are reasonably high, so it resists trivially maximizing one (e.g. predict everything positive → recall 1, precision tiny → F1 still low). That is exactly why F1 beats accuracy for imbalanced detection.
#f1#harmonic-mean#am-hm#derivation
Expertconcept

Explain the 'double descent' phenomenon and why it complicates the classical bias-variance picture.

Classical theory predicts test error is U-shaped in model complexity. Double descent shows that past the interpolation threshold — where the model has just enough parameters to fit training data exactly — test error first peaks (variance explodes as the model is forced through every point) then descends again as you keep over-parameterizing. In the overparameterized regime, gradient descent's implicit regularization selects minimum-norm interpolators that generalize well. This is why huge neural nets and modern LLMs generalize despite memorizing training data, breaking the naive 'more parameters = overfit' intuition.
#double-descent#overparameterization#interpolation#implicit-regularization
Expertsystem-design

Your offline validation metrics are strong but the model degrades in production. Enumerate the diagnostic causes a principal engineer would check.

Train-serving skew: features computed differently or with different freshness online vs offline. Subtle leakage inflating offline scores (post-label features, contaminated preprocessing). Distribution shift — covariate shift (input distribution moves), label/prior shift, or concept drift (P(y|x) changes) — making the static test set stale. Sampling bias: validation set not representative of live traffic. Feedback loops where the model alters the data it later sees. Metric mismatch: the offline metric (AUC) doesn't track the business KPI. Monitor live calibration, PSI/population stability, and run shadow/A-B tests rather than trusting frozen offline numbers.
#distribution-shift#train-serving-skew#concept-drift#monitoring#feedback-loop
Expertsystem-design

For a cloud AutoML fraud-detection scenario where fraud is 0.5% of transactions and missed fraud costs far more than reviewing a false alarm, which evaluation metric should you optimize, and why are accuracy and ROC-AUC poor choices?

Optimize for recall (or an $F_\beta$ with $\beta>1$, or PR-AUC at a recall-weighted threshold), because catching fraud matters most and false alarms are cheap. Accuracy fails: predicting 'all legitimate' yields 99.5% accuracy with zero fraud caught. ROC-AUC misleads because its FPR denominator is the huge legitimate pool, so it stays optimistically high even when precision is poor under extreme imbalance. PR-AUC and recall surface the rare-positive performance the business actually cares about.
#fraud-detection#recall#pr-auc#imbalance#cost-sensitive
Expertconcept

What is the difference between covariate shift, prior probability shift, and concept drift, and how does each affect whether your test-set evaluation remains valid?

Covariate shift: P(x) changes but P(y|x) is stable — the model is still correct where it has support, but a fixed test set may under-represent new input regions; reweight by importance or retrain. Prior (label) shift: P(y) changes (e.g. fraud rate rises) while P(x|y) holds — calibration and threshold-dependent metrics drift; recalibrate priors. Concept drift: P(y|x) itself changes — the learned mapping is now wrong, no reweighting fixes it, and fresh labels are needed. A frozen test set silently loses validity under all three; only concept drift requires relearning the function.
#covariate-shift#concept-drift#prior-shift#test-validity#recalibration