Classical ML (Supervised & Unsupervised)

Classical ML (Supervised & Unsupervised) — interview questions and answers with clear explanations.

Study these interactively →
Foundationalconcept

What distinguishes a generative classifier from a discriminative one? Give one example of each.

A generative model learns the joint $p(x,y)$ — typically $p(x\mid y)$ and prior $p(y)$ — then applies Bayes' rule for $p(y\mid x)$; examples: Naive Bayes, GMM/LDA classifiers. A discriminative model learns $p(y\mid x)$ (or just a decision boundary) directly without modeling how $x$ is generated; examples: logistic regression, SVM. Generative models can sample data, handle missing features, and often win with little data; discriminative models usually reach lower asymptotic error because they optimize exactly the quantity used for prediction.
#generative#discriminative#naive-bayes#logistic-regression
Foundationalconcept

Why is logistic regression trained by minimizing log-loss (cross-entropy) rather than squared error?

Log-loss is the negative log-likelihood under the Bernoulli model implied by the sigmoid, so minimizing it is maximum likelihood. It is also convex in the weights, giving a unique global optimum, whereas squared error on the sigmoid output is non-convex with flat regions where gradient descent stalls. Cross-entropy's gradient $(\hat{y}-y)x$ stays large when predictions are confidently wrong, giving strong signal; squared-error gradients vanish in the sigmoid's saturated tails — killing learning exactly where correction is most needed.
#logistic-regression#cross-entropy#convexity#mle
Foundationalconcept

In one sentence each, contrast bagging and boosting by which error component they primarily reduce.

Bagging (e.g. random forests) trains many high-variance, low-bias learners in parallel on bootstrap samples and averages them, reducing variance while leaving bias roughly unchanged. Boosting (e.g. gradient boosting, AdaBoost) trains weak, high-bias learners sequentially, each fitting the residual errors of the ensemble so far, reducing bias (and some variance), which is why shallow trees suffice as base learners. Hence bagging is robust to overfitting and parallelizable, whereas boosting can overfit if run too long and is inherently sequential.
#bagging#boosting#bias-variance#ensembles
Intermediateconcept

What is the kernel trick, and why does it let an SVM separate non-linearly separable data without explicit feature maps?

The SVM dual and decision function depend on training points only through inner products $\langle x_i,x_j\rangle$. A kernel $K(x_i,x_j)=\langle\phi(x_i),\phi(x_j)\rangle$ computes that inner product in a (possibly infinite-dimensional) feature space directly from inputs, so you never form $\phi(x)$. Mercer's condition (PSD $K$) guarantees such a $\phi$ exists. A linear separator in feature space is a non-linear boundary in input space, so an RBF kernel — an infinite-dimensional map — can carve arbitrarily curved boundaries while all computation stays in input dimensions.
#svm#kernel-trick#rbf#mercer
Intermediateconcept

Why does kNN suffer badly from the curse of dimensionality, and what breaks specifically?

In high dimensions distances concentrate: the ratio of nearest to farthest neighbor distance approaches 1, so 'nearest' loses meaning and every point looks roughly equidistant — the locality assumption underpinning kNN collapses. To keep a fixed fraction of data in a neighborhood, the neighborhood's edge length must grow toward the full range as $d$ rises, so neighbors are no longer local, and populating the space needs exponentially more samples. Accuracy degrades and irrelevant features dilute the metric. Mitigations: dimensionality reduction, learned/weighted metrics, feature selection.
#knn#curse-of-dimensionality#distance-metrics
Intermediateconcept

Decision trees use Gini impurity or entropy for splits. How do they differ, and does the choice usually matter?

For class proportions $p_k$, Gini is $1-\sum_k p_k^2$ and entropy is $-\sum_k p_k\log p_k$. Both are maximized at a uniform distribution and zero at purity, and the split criterion maximizes impurity reduction (information gain for entropy). Entropy penalizes uncertain nodes slightly more (sharper peak) and is marginally costlier due to the log. In practice they produce very similar trees, and the choice rarely changes performance — regularization (depth, min-samples, pruning) matters far more than Gini-vs-entropy.
#decision-trees#gini#entropy#information-gain
Intermediateconcept

Naive Bayes makes a 'naive' conditional independence assumption that is usually false, yet it often classifies well. Why?

Naive Bayes assumes features are conditionally independent given the class, so $p(x\mid y)=\prod_j p(x_j\mid y)$ — rarely literally true, which makes its probability estimates poorly calibrated (often pushed toward 0/1). But classification needs only the argmax of the posterior, not accurate probabilities. As long as the true class still gets the highest score — i.e. dependence errors don't flip the ranking — the decision is correct. Correlated features cause double-counting that biases the magnitude of $\log p(y\mid x)$ but often in a rank-preserving way. So NB can have low classification error despite high probability error; it's a strong, fast baseline for high-dimensional sparse text.
#naive-bayes#conditional-independence#calibration#argmax
Intermediateconcept

For a binary SVM, what is the role of the slack penalty C, and how does varying it trade off bias and variance via the margin?

The soft-margin SVM minimizes $\tfrac12\lVert w\rVert^2 + C\sum_i \xi_i$, where $\xi_i$ are slacks allowing margin violations; $C$ weights those violations (inverse regularization). Large $C$ heavily penalizes violations, shrinking the margin to fit training data tightly — low bias, high variance, prone to overfitting and outlier-sensitive. Small $C$ tolerates violations for a wider margin — higher bias, lower variance, smoother boundary, more robust to noise but may underfit. $C$ is tuned by cross-validation jointly with the RBF kernel's $\gamma$, since the two interact strongly in controlling effective complexity.
#svm#soft-margin#regularization#bias-variance
Intermediatecert

Cert-style: a managed AutoML/tabular service offers Linear, XGBoost, and a deep tabular net. Your data is 12k rows, 40 mixed numeric/categorical features, heavy non-linear interactions, some missing values, and the stakeholder wants quick feature importance. Which baseline and why?

Pick gradient-boosted trees (XGBoost). For small-to-medium tabular data (12k rows) with non-linear interactions and mixed types, GBDTs are the consistent state of the art and typically beat deep nets, which need far more data and tuning on tabular. XGBoost handles missing values natively via default-direction learning at splits, captures interactions without manual engineering, trains fast, and provides built-in feature importance (gain/SHAP) the stakeholder wants. Linear would underfit the interactions; the deep tabular net is overkill and data-hungry here. Canonical answer: tabular + moderate size + interactions → boosted-trees baseline.
#xgboost#tabular#automl#feature-importance
Advancedconcept

Random forests inject two sources of randomness. What are they, and why is feature subsampling at each split the more important one?

The two are bootstrap sampling of rows (bagging) and random subsampling of $m$ candidate features at each split. Bagging alone leaves trees highly correlated because a few dominant features get chosen first in nearly every tree, and averaging correlated estimators barely reduces variance — the variance of the mean is $\rho\sigma^2+\frac{1-\rho}{B}\sigma^2$, floored by correlation $\rho$. Feature subsampling forces trees to use different features, decorrelating them (lowering $\rho$), which is what actually drives variance reduction. Smaller $m$ means more decorrelation but higher individual-tree bias.
#random-forest#decorrelation#variance#feature-subsampling
Advancedmath

Derive the update in gradient boosting and explain in what sense it is gradient descent in function space.

Gradient boosting minimizes $L(y,F(x))$ over the function $F$. At iteration $m$, treating $F$ as the variable, the negative functional gradient at each point is $r_{im}=-[\partial L(y_i,F(x_i))/\partial F(x_i)]_{F=F_{m-1}}$ — the pseudo-residuals. A weak learner $h_m$ is fit to these by least squares, then $F_m=F_{m-1}+\nu\gamma_m h_m$, where $\gamma_m$ solves a line search and $\nu$ is the learning rate. Each tree approximates the steepest-descent direction in function space evaluated at the data, so the ensemble sums small steps down the loss surface. For squared loss the residuals are simply $y_i-F_{m-1}(x_i)$.
#gradient-boosting#functional-gradient#pseudo-residuals#derivation
Advancedmath

XGBoost uses a second-order Taylor expansion of the loss. Write the optimal leaf weight and the split gain.

XGBoost approximates per-instance loss by $g_i f(x_i)+\tfrac12 h_i f(x_i)^2$ where $g_i,h_i$ are the first and second derivatives w.r.t. the current prediction, plus $\Omega(f)=\gamma T+\tfrac12\lambda\sum_j w_j^2$. For leaf $j$ with instances $I_j$, the optimal weight is $w_j^*=-\frac{\sum_{i\in I_j} g_i}{\sum_{i\in I_j} h_i+\lambda}$ and the structure score is $-\tfrac12\sum_j \frac{(\sum g_i)^2}{\sum h_i+\lambda}+\gamma T$. A split's gain is $\tfrac12[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma$; negative gain stops the split.
#xgboost#newton-boosting#regularization#split-gain
Advancedconcept

How does LightGBM differ from XGBoost in tree growth and feature handling? Name its two signature techniques and the tradeoff of leaf-wise growth.

LightGBM uses GOSS (Gradient-based One-Side Sampling) — keep all large-gradient instances and randomly subsample small-gradient ones, reweighting to stay unbiased — and EFB (Exclusive Feature Bundling), which bundles mutually-exclusive sparse features to shrink the feature count. It grows trees leaf-wise (best-first: split the leaf with max loss reduction) rather than XGBoost's level-wise growth. Leaf-wise yields lower loss for the same number of leaves and trains faster, but goes deeper on informative paths and overfits more readily on small data — so it needs num_leaves/max_depth/min_data_in_leaf limits. Both use histogram-binned features.
#lightgbm#goss#efb#leaf-wise
Advancedmath

k-means minimizes within-cluster sum of squares. Prove it monotonically decreases the objective, and explain why it can still reach a bad local optimum.

The objective is $J=\sum_i \lVert x_i-\mu_{c_i}\rVert^2$. The assignment step puts each $x_i$ at its nearest centroid, which can only lower or keep $J$ for fixed centroids. The update step sets each $\mu_k$ to its members' mean — the unique minimizer of $\sum_{i\in k}\lVert x_i-\mu_k\rVert^2$ — which can only lower or keep $J$ for fixed assignments. Both steps are non-increasing and $J\ge0$, so it converges. But $J$ is non-convex over the combinatorial assignments, so it reaches only a local optimum; bad initialization can trap it. Mitigate with k-means++ seeding and multiple restarts.
#k-means#convergence#local-optima#coordinate-descent
Advancedmath

PCA can be framed as maximizing variance or minimizing reconstruction error. Show these are equivalent and tie it to the covariance eigendecomposition.

For centered data, projecting onto unit vector $w$ gives variance $w^\top \Sigma w$ with $\Sigma=\frac1n X^\top X$. The reconstruction error of projecting onto $w$ and back is $\mathbb{E}\lVert x-ww^\top x\rVert^2=\mathbb{E}\lVert x\rVert^2-w^\top\Sigma w$. Since $\mathbb{E}\lVert x\rVert^2$ is fixed, minimizing reconstruction error is exactly maximizing projected variance. Maximizing $w^\top\Sigma w$ s.t. $\lVert w\rVert=1$ via Lagrange multipliers gives $\Sigma w=\lambda w$: the optimal directions are eigenvectors of $\Sigma$ and the captured variance equals $\lambda$. So the top-$k$ principal components are the $k$ largest-eigenvalue eigenvectors.
#pca#variance#reconstruction#eigendecomposition
Advancedconcept

Compare single-linkage and complete-linkage agglomerative clustering: what failure mode does each have?

Single linkage uses the minimum pairwise distance, merging via the nearest bridge — it can recover elongated, non-convex shapes but suffers 'chaining', where a thin string of points links two otherwise-distinct clusters, and is sensitive to noise/outliers acting as bridges. Complete linkage uses the maximum pairwise distance, producing compact, roughly equal-diameter clusters, but is biased toward spherical clusters, breaks up large or elongated true clusters, and is sensitive to outliers that inflate the max. Average linkage (UPGMA) and Ward's (minimize variance increase) sit between; Ward tends to give balanced, compact clusters and is a common default.
#hierarchical-clustering#single-linkage#complete-linkage#chaining
Advancedsystem-design

You have 5,000 examples, ~1.2M sparse binary text features, and need calibrated probabilities and interpretable coefficients fast. Justify a model choice over a gradient-boosted tree ensemble.

L1/L2-regularized logistic regression is the right call. It handles ultra-high-dimensional sparse input cheaply (cost scales with non-zeros), gives directly interpretable per-feature coefficients, and outputs well-calibrated probabilities since it optimizes log-loss directly. With only 5,000 rows against 1.2M features, GBDTs would overfit, are slow to find good axis-aligned splits in extreme sparsity, fragment signal across many weak features, and yield less interpretable, often miscalibrated scores needing post-hoc calibration. L1 also produces a sparse, feature-selected model. If strong non-linear interactions matter and data grows, revisit with a linear SVM or hashed-feature GBDT.
#logistic-regression#sparse#high-dimensional#model-selection
Expertmath

Show that k-means is the hard-assignment, equal-spherical-covariance limit of a GMM fit by EM.

A GMM with isotropic covariances $\Sigma_k=\sigma^2 I$ and equal mixing weights has E-step responsibility $\gamma_{ik}\propto\exp(-\lVert x_i-\mu_k\rVert^2/2\sigma^2)$. As $\sigma^2\to0$ the softmax becomes a hard argmax — the nearest centroid gets responsibility 1 and the rest 0, exactly k-means' assignment step. The M-step sets $\mu_k$ to the responsibility-weighted mean, which collapses to the plain mean of assigned points — k-means' update. So k-means is EM on a shared-isotropic-covariance GMM shrunk to zero variance, doing winner-take-all assignments. This is why k-means assumes spherical, equal-size clusters and fails on elongated or unequal-variance ones.
#k-means#gmm#em#hard-assignment
Expertmath

Why does EM for GMMs never decrease the data log-likelihood, and what does the ELBO have to do with it?

EM maximizes a lower bound on $\log p(X\mid\theta)$. With a distribution $q$ over latent assignments, Jensen gives $\log p(X\mid\theta)\ge \mathbb{E}_q[\log p(X,Z\mid\theta)]-\mathbb{E}_q[\log q]$ — the ELBO — with gap exactly $\mathrm{KL}(q\,\|\,p(Z\mid X,\theta))$. The E-step sets $q=p(Z\mid X,\theta^{old})$, zeroing the KL so the bound touches the likelihood. The M-step maximizes the ELBO over $\theta$, raising it. Since after the E-step the bound equals the likelihood, raising it raises the likelihood, so each iteration is non-decreasing. It converges to a local maximum (likelihood is unbounded if a component collapses onto one point).
#em#elbo#jensen#kl-divergence
Expertconcept

A candidate uses t-SNE coordinates to argue cluster A is 'twice as far' from B as from C, and that A's tightness means low intrinsic variance. What's wrong?

t-SNE preserves local neighborhoods, not global geometry, so inter-cluster distances and the gaps between clusters aren't meaningful — relative cluster distances and sizes are largely artifacts of the perplexity and the crowding correction (Student-t in the map). Cluster density/size in the embedding doesn't reflect true variance; t-SNE expands dense regions and contracts sparse ones. The layout also shifts with perplexity, seed, and iterations, and can manufacture apparent clusters from uniform data. Use it for qualitative neighborhood inspection only; for distance claims use the original space, PCA, or metric MDS.
#t-sne#visualization#crowding-problem#misinterpretation
Expertconcept

How does UMAP differ from t-SNE in its objective and what it preserves, and why is it usually faster?

UMAP models the data as a fuzzy topological structure (a weighted k-NN graph from local connectivity) and optimizes a cross-entropy between high- and low-dimensional fuzzy simplicial sets, whereas t-SNE minimizes KL divergence between Gaussian/Student-t neighbor probabilities. UMAP's cross-entropy balances attractive (keep neighbors close) and repulsive (push non-neighbors apart) terms more symmetrically, so it tends to retain more global structure while remaining local-first. It's faster via approximate nearest-neighbor search (NN-descent), negative sampling for the repulsion, and SGD on the graph. Like t-SNE, absolute distances and cluster sizes stay only loosely meaningful.
#umap#t-sne#cross-entropy#manifold