Feature Engineering & Data Prep

Feature Engineering & Data Prep — interview questions and answers with clear explanations.

Study these interactively →
Foundationalconcept

What is the difference between one-hot encoding and ordinal (label) encoding, and when would you choose each?

One-hot encoding creates a binary indicator column per category, treating categories as unordered and equidistant — correct for nominal features fed to linear models, SVMs, and neural nets. Ordinal encoding maps categories to integers, implying a rank and magnitude. Use ordinal only when a true order exists (low<medium<high) or for tree models that split on thresholds and don't assume linearity. Applying integer labels to nominal data in a linear model fabricates a false ordinal relationship; one-hot on high-cardinality features explodes dimensionality, where target or hashing encoding is preferable.
#encoding#one-hot#ordinal#categorical
Intermediateconcept

Why must scaling/normalization parameters (mean, std, min, max) be fit on the training set only, and what bug occurs if you fit on the full dataset?

Fitting a scaler on train+test lets test-set statistics leak into training, inflating validation scores and producing optimistic, non-reproducible generalization estimates. The correct pipeline calls fit on train, then transform on train, validation, and test with those stored parameters — and in cross-validation the scaler is re-fit inside each fold. The bug is data leakage: at inference time future/test data is unavailable, so any statistic computed over it can't legitimately inform the transform. Use Pipeline/ColumnTransformer so the fit is fold-scoped automatically.
#scaling#normalization#leakage#cross-validation
Intermediateconcept

Compare standardization (z-score), min-max scaling, and robust scaling. Which is appropriate in the presence of outliers, and why?

Standardization subtracts the mean and divides by std, centering at 0 with unit variance — but both statistics are outlier-sensitive, so extremes distort the scale. Min-max maps to a fixed range like $[0,1]$ using min/max, even more outlier-fragile since a single extreme point compresses everyone else. Robust scaling subtracts the median and divides by the IQR ($Q_3-Q_1$), using order statistics that resist outliers — the preferred choice with heavy tails or anomalies. Min-max suits bounded inputs (image pixels); standardization suits roughly Gaussian features and distance/gradient-based models.
#scaling#outliers#robust-scaler#standardization
Intermediateconcept

How should a raw datetime field be feature-engineered for a model with no native time awareness? Address cyclicity.

Decompose the timestamp into usable components: hour, day-of-week, day-of-month, month, quarter, year, is_weekend, is_holiday, and deltas (days since signup, time-to-event). Critically, cyclic features like hour or month aren't linear — hour 23 and hour 0 are adjacent, but as integers they're maximally apart. Encode them with sine/cosine pairs: $\sin(2\pi t/T)$ and $\cos(2\pi t/T)$ where $T$ is the period (24, 7, 12), placing each value on a circle so the model sees 23↔0 proximity. For tree models the raw integer split often suffices; the sin/cos trick matters most for linear and neural models. Avoid leaking future-derived aggregates.
#datetime#cyclical#feature-engineering#encoding
Intermediateconcept

Compare filter, wrapper, and embedded feature-selection methods. Give a failure mode of univariate filter selection.

Filter methods score features independently of any model (correlation, mutual information, chi-squared, variance threshold) — fast and model-agnostic but blind to feature interactions. Wrapper methods (recursive feature elimination, forward/backward selection) train a model on feature subsets and search by validation score — accurate but expensive and prone to overfitting the search. Embedded methods select during training (L1/Lasso zeroing coefficients, tree feature importances, gradient-boosting gains) — a good cost/quality balance. The classic filter failure: a feature with zero univariate correlation can be highly predictive in combination (XOR), so univariate filtering discards jointly-informative features and keeps redundant correlated ones.
#feature-selection#filter#wrapper#embedded
Advancedconcept

Explain target (mean) encoding and the leakage it introduces. How do smoothing and out-of-fold encoding mitigate it?

Target encoding replaces each category with the mean target value for that category — powerful for high-cardinality features but it leaks the label: a category seen once is encoded as that row's own target, memorizing it. Smoothing blends the category mean with the global prior, $\hat{y}_c = \frac{n_c\bar{y}_c + m\bar{y}}{n_c + m}$, shrinking rare categories toward the prior to cut variance. Out-of-fold (leave-one-out or K-fold) encoding computes each row's value from data excluding that row/fold, breaking the direct target-to-feature path. Add noise and always fit the encoder inside CV folds to prevent optimistic bias.
#target-encoding#leakage#smoothing#high-cardinality
Advancedconcept

Distinguish MCAR, MAR, and MNAR missingness. Why does the mechanism determine whether simple imputation is valid?

MCAR (missing completely at random): missingness is independent of all data, so deletion or mean-imputation is unbiased, just less efficient. MAR (missing at random): missingness depends on observed variables but not the missing value itself — conditional/model-based imputation (MICE, regression on observed covariates) is unbiased. MNAR (missing not at random): missingness depends on the unobserved value (e.g. high earners hiding income), so no method using only observed data is unbiased; you need an explicit missingness model or a 'missing' indicator. The mechanism matters because mean/median imputation silently assumes MCAR; under MAR/MNAR it biases estimates and shrinks variance.
#missing-data#mcar#mar#mnar#imputation
Advancedconcept

You add a binary 'was-missing' indicator column alongside imputing the value. What does this buy you, and what is the risk?

The indicator lets the model learn signal from the fact of missingness itself — often informative under MAR/MNAR (a blank 'income' may correlate with the target). It decouples 'value present and equals X' from 'value was absent and imputed to X', so the imputed constant isn't confused with genuine observations, preserving information a plain mean-impute destroys. Risks: it doubles columns for high-missingness data, can overfit if missingness patterns differ between train and serving, and if missingness is an artifact of leakage (a field only filled post-outcome) the indicator imports that leakage. Validate that the missingness mechanism is stable across train/serve.
#missing-data#imputation#indicator#leakage
Advancedconcept

Define data leakage from feature engineering and give three concrete, easy-to-miss examples in a tabular ML pipeline.

Leakage is when information unavailable at prediction time, or derived from the target/test set, contaminates training features — inflating offline metrics and collapsing in production. Subtle cases: (1) fitting scalers/imputers/encoders on the full dataset before the train/test split, so test statistics leak in; (2) target leakage from features computed after or as a function of the outcome ('number_of_late_payments' when predicting default, or an ID assigned post-decision); (3) temporal leakage — using future aggregates, or random K-fold on time-series so the model trains on data chronologically after the validation rows. Also: oversampling/SMOTE before the split, and target encoding without out-of-fold.
#leakage#feature-engineering#temporal#preprocessing
Advancedconcept

Why is SMOTE applied AFTER the train/test split and INSIDE cross-validation folds, and what failure occurs if you SMOTE first?

SMOTE synthesizes minority samples by interpolating between a point and its k-nearest minority neighbors. Applied before the split, synthetic points generated from a real minority example can land in both train and test, so the model effectively sees near-copies of test rows — leaking and yielding wildly optimistic recall/AUC that won't reproduce. Worse, in CV the validation fold must reflect the real, imbalanced distribution to estimate deployment performance honestly; resampling it distorts the metric. Correct order: split first, then within each training fold run SMOTE on train only, evaluate on the untouched, naturally-imbalanced validation fold (use an imblearn Pipeline).
#smote#imbalanced#leakage#cross-validation
Advancedconcept

Contrast class weighting, oversampling (SMOTE), and undersampling for imbalanced classification. When is simply not resampling the right call?

Class weighting reweights the loss to penalize minority errors more (class_weight='balanced'), changing nothing about the data and avoiding synthetic artifacts — usually the first thing to try. Oversampling/SMOTE adds minority examples, raising recall but risking overfit and noisy synthetic points near class boundaries. Undersampling discards majority data — fast but throws away information. Often the best move is no resampling: keep the natural distribution, optimize a threshold on calibrated probabilities for your cost matrix, and use ranking metrics (PR-AUC, recall@k). Resampling distorts predicted probabilities (needs recalibration) and rarely beats good thresholding plus a proper scoring rule.
#imbalanced#class-weights#smote#threshold-tuning
Advancedcoding

Implement target encoding with additive smoothing in pseudo-code, computed out-of-fold to avoid leakage.

global_mean = mean(y_train); for each of K folds, fit on the other folds: for category c compute n_c, sum_c over the fit-folds, then enc[c] = (sum_c + m*global_mean) / (n_c + m) where m is the smoothing strength; assign enc[c] to the held-out fold's rows; unseen categories map to global_mean. For the test set, fit the encoder on ALL training rows once and apply. Invariants: a row is never encoded using its own target, rare categories shrink toward global_mean via m, and the same fitted map (not refit on test) transforms test data. Optionally add Gaussian noise to training encodings to reduce overfit.
#target-encoding#coding#smoothing#out-of-fold
Advancedconcept

What are feature crosses, why do they help linear models, and what is the scaling/sparsity problem with crossing high-cardinality categoricals?

A feature cross is a synthetic feature combining two or more features (country×device, or binned-latitude×binned-longitude), letting a linear model learn interaction effects it otherwise can't represent — each combination gets its own weight, effectively memorizing region-specific behavior. The problem: crossing categoricals of cardinality $a$ and $b$ yields up to $a\times b$ columns, exploding dimensionality and producing extremely sparse, rarely-observed combinations that overfit and bloat memory. Mitigations: the hashing trick to bound output dimension, frequency thresholds to drop rare crosses, embeddings to learn dense low-rank interactions, or letting tree/DNN models learn interactions implicitly instead of hand-crossing.
#feature-crosses#interactions#sparsity#hashing
Expertsystem-design

When and how do you use a learned embedding (from a neural net or a pretrained text/graph model) as a feature in a downstream model? Name the leakage and dimensionality risks.

Embeddings map high-cardinality or unstructured inputs (text, IDs, items, graph nodes) into a dense low-dimensional vector capturing similarity, then feed that vector into a downstream model (e.g. gradient boosting) — useful when one-hot is too sparse or when transferring from a pretrained model (sentence transformers, item2vec). Risks: (1) leakage if the embedding is trained on the same labels/rows used for downstream evaluation, or on future interactions — train embeddings only on train-split, pre-outcome data; (2) the dense $d$-dim vector can dominate or add noise, so consider dimensionality reduction (PCA) and regularization; (3) cold-start/unseen entities need a fallback (mean vector). Freeze pretrained embeddings unless you have enough data to fine-tune.
#embeddings#transfer-learning#leakage#dimensionality
Expertsystem-design

A model with strong offline AUC degrades sharply in production. Walk through how feature-level training-serving skew causes this and how you'd detect it.

Training-serving skew arises when a feature is computed differently, or from different data, at train vs serve time: an aggregate over the full historical window offline but a partial window online; a time-zone or unit mismatch; a differing imputation default; or a feature present in the labeled backfill but late-arriving (or absent) at request time, so production sees nulls the model never trained on. The model relies on a distribution it won't see live. Detect with a feature store enforcing a single transform shared by both paths, point-in-time-correct joins to prevent label leakage, and continuous monitoring of per-feature distributions (PSI/KL train-vs-serving) plus logging served feature values and replaying them offline.
#training-serving-skew#feature-store#monitoring#leakage
Expertconcept

Why can naive outlier removal harm a model, and what are principled alternatives to deleting extreme values?

Deleting outliers assumes they're errors, but extremes are often the most informative (fraud, churn, equipment failure) — removing them discards exactly the signal you care about and biases the model toward the bulk distribution; removing based on the target also leaks. And fitting an outlier rule on the full dataset mirrors preprocessing leakage. Principled alternatives: winsorize/clip to robust percentiles (fit on train), apply variance-stabilizing transforms (log, Box-Cox, Yeo-Johnson) to compress tails, use robust scalers (median/IQR), choose outlier-robust losses/models (Huber loss, tree ensembles), or add an indicator flag rather than dropping rows — letting the model decide the extreme's importance.
#outliers#winsorize#robust-loss#transforms