Q: For a binary SVM, what is the role of the slack penalty C, and how does varying it trade off bias and variance via the margin?

The soft-margin SVM minimizes \tfrac12\lVert w\rVert^2 + C\sum_i \xi_i, where \xi_i are slacks allowing margin violations; C weights those violations (inverse regularization). Large C heavily penalizes violations, shrinking the margin to fit training data tightly — low bias, high variance, prone to overfitting and outlier-sensitive. Small C tolerates violations for a wider margin — higher bias, lower variance, smoother boundary, more robust to noise but may underfit. C is tuned by cross-validation jointly with the RBF kernel's \gamma, since the two interact strongly in controlling effective complexity.

Question 1

What distinguishes a generative classifier from a discriminative one? Give one example of each.

Accepted Answer

A generative model learns the joint p(x,y) — typically p(x\mid y) and prior p(y) — then applies Bayes' rule for p(y\mid x); examples: Naive Bayes, GMM/LDA classifiers. A discriminative model learns p(y\mid x) (or just a decision boundary) directly without modeling how x is generated; examples: logistic regression, SVM. Generative models can sample data, handle missing features, and often win with little data; discriminative models usually reach lower asymptotic error because they optimize exactly the quantity used for prediction.

Question 2

Why is logistic regression trained by minimizing log-loss (cross-entropy) rather than squared error?

Accepted Answer

Log-loss is the negative log-likelihood under the Bernoulli model implied by the sigmoid, so minimizing it is maximum likelihood. It is also convex in the weights, giving a unique global optimum, whereas squared error on the sigmoid output is non-convex with flat regions where gradient descent stalls. Cross-entropy's gradient (\hat{y}-y)x stays large when predictions are confidently wrong, giving strong signal; squared-error gradients vanish in the sigmoid's saturated tails — killing learning exactly where correction is most needed.

Question 3

In one sentence each, contrast bagging and boosting by which error component they primarily reduce.

Accepted Answer

Bagging (e.g. random forests) trains many high-variance, low-bias learners in parallel on bootstrap samples and averages them, reducing variance while leaving bias roughly unchanged. Boosting (e.g. gradient boosting, AdaBoost) trains weak, high-bias learners sequentially, each fitting the residual errors of the ensemble so far, reducing bias (and some variance), which is why shallow trees suffice as base learners. Hence bagging is robust to overfitting and parallelizable, whereas boosting can overfit if run too long and is inherently sequential.

Question 4

What is the kernel trick, and why does it let an SVM separate non-linearly separable data without explicit feature maps?

Accepted Answer

The SVM dual and decision function depend on training points only through inner products \langle x_i,x_j\rangle. A kernel K(x_i,x_j)=\langle\phi(x_i),\phi(x_j)\rangle computes that inner product in a (possibly infinite-dimensional) feature space directly from inputs, so you never form \phi(x). Mercer's condition (PSD K) guarantees such a \phi exists. A linear separator in feature space is a non-linear boundary in input space, so an RBF kernel — an infinite-dimensional map — can carve arbitrarily curved boundaries while all computation stays in input dimensions.

Question 5

Why does kNN suffer badly from the curse of dimensionality, and what breaks specifically?

Accepted Answer

In high dimensions distances concentrate: the ratio of nearest to farthest neighbor distance approaches 1, so 'nearest' loses meaning and every point looks roughly equidistant — the locality assumption underpinning kNN collapses. To keep a fixed fraction of data in a neighborhood, the neighborhood's edge length must grow toward the full range as d rises, so neighbors are no longer local, and populating the space needs exponentially more samples. Accuracy degrades and irrelevant features dilute the metric. Mitigations: dimensionality reduction, learned/weighted metrics, feature selection.

Question 6

Decision trees use Gini impurity or entropy for splits. How do they differ, and does the choice usually matter?

Accepted Answer

For class proportions p_k, Gini is 1-\sum_k p_k^2 and entropy is -\sum_k p_k\log p_k. Both are maximized at a uniform distribution and zero at purity, and the split criterion maximizes impurity reduction (information gain for entropy). Entropy penalizes uncertain nodes slightly more (sharper peak) and is marginally costlier due to the log. In practice they produce very similar trees, and the choice rarely changes performance — regularization (depth, min-samples, pruning) matters far more than Gini-vs-entropy.

Question 7

Naive Bayes makes a 'naive' conditional independence assumption that is usually false, yet it often classifies well. Why?

Accepted Answer

Naive Bayes assumes features are conditionally independent given the class, so p(x\mid y)=\prod_j p(x_j\mid y) — rarely literally true, which makes its probability estimates poorly calibrated (often pushed toward 0/1). But classification needs only the argmax of the posterior, not accurate probabilities. As long as the true class still gets the highest score — i.e. dependence errors don't flip the ranking — the decision is correct. Correlated features cause double-counting that biases the magnitude of \log p(y\mid x) but often in a rank-preserving way. So NB can have low classification error despite high probability error; it's a strong, fast baseline for high-dimensional sparse text.

Question 8

For a binary SVM, what is the role of the slack penalty C, and how does varying it trade off bias and variance via the margin?

Accepted Answer

The soft-margin SVM minimizes 	frac12\lVert wVert^2 + C\sum_i \xi_i, where \xi_i are slacks allowing margin violations; C weights those violations (inverse regularization). Large C heavily penalizes violations, shrinking the margin to fit training data tightly — low bias, high variance, prone to overfitting and outlier-sensitive. Small C tolerates violations for a wider margin — higher bias, lower variance, smoother boundary, more robust to noise but may underfit. C is tuned by cross-validation jointly with the RBF kernel's \gamma, since the two interact strongly in controlling effective complexity.

Question 9

Cert-style: a managed AutoML/tabular service offers Linear, XGBoost, and a deep tabular net. Your data is 12k rows, 40 mixed numeric/categorical features, heavy non-linear interactions, some missing values, and the stakeholder wants quick feature importance. Which baseline and why?

Accepted Answer

Pick gradient-boosted trees (XGBoost). For small-to-medium tabular data (12k rows) with non-linear interactions and mixed types, GBDTs are the consistent state of the art and typically beat deep nets, which need far more data and tuning on tabular. XGBoost handles missing values natively via default-direction learning at splits, captures interactions without manual engineering, trains fast, and provides built-in feature importance (gain/SHAP) the stakeholder wants. Linear would underfit the interactions; the deep tabular net is overkill and data-hungry here. Canonical answer: tabular + moderate size + interactions → boosted-trees baseline.

Question 10

Random forests inject two sources of randomness. What are they, and why is feature subsampling at each split the more important one?

Accepted Answer

The two are bootstrap sampling of rows (bagging) and random subsampling of m candidate features at each split. Bagging alone leaves trees highly correlated because a few dominant features get chosen first in nearly every tree, and averaging correlated estimators barely reduces variance — the variance of the mean is \rho\sigma^2+\frac{1-\rho}{B}\sigma^2, floored by correlation \rho. Feature subsampling forces trees to use different features, decorrelating them (lowering \rho), which is what actually drives variance reduction. Smaller m means more decorrelation but higher individual-tree bias.

Question 11

Derive the update in gradient boosting and explain in what sense it is gradient descent in function space.

Accepted Answer

Gradient boosting minimizes L(y,F(x)) over the function F. At iteration m, treating F as the variable, the negative functional gradient at each point is r_{im}=-[\partial L(y_i,F(x_i))/\partial F(x_i)]_{F=F_{m-1}} — the pseudo-residuals. A weak learner h_m is fit to these by least squares, then F_m=F_{m-1}+
u\gamma_m h_m, where \gamma_m solves a line search and 
u is the learning rate. Each tree approximates the steepest-descent direction in function space evaluated at the data, so the ensemble sums small steps down the loss surface. For squared loss the residuals are simply y_i-F_{m-1}(x_i).

Question 12

XGBoost uses a second-order Taylor expansion of the loss. Write the optimal leaf weight and the split gain.

Accepted Answer

XGBoost approximates per-instance loss by g_i f(x_i)+	frac12 h_i f(x_i)^2 where g_i,h_i are the first and second derivatives w.r.t. the current prediction, plus \Omega(f)=\gamma T+	frac12\lambda\sum_j w_j^2. For leaf j with instances I_j, the optimal weight is w_j^*=-\frac{\sum_{i\in I_j} g_i}{\sum_{i\in I_j} h_i+\lambda} and the structure score is -	frac12\sum_j \frac{(\sum g_i)^2}{\sum h_i+\lambda}+\gamma T. A split's gain is 	frac12[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma; negative gain stops the split.

Question 13

How does LightGBM differ from XGBoost in tree growth and feature handling? Name its two signature techniques and the tradeoff of leaf-wise growth.

Accepted Answer

LightGBM uses GOSS (Gradient-based One-Side Sampling) — keep all large-gradient instances and randomly subsample small-gradient ones, reweighting to stay unbiased — and EFB (Exclusive Feature Bundling), which bundles mutually-exclusive sparse features to shrink the feature count. It grows trees leaf-wise (best-first: split the leaf with max loss reduction) rather than XGBoost's level-wise growth. Leaf-wise yields lower loss for the same number of leaves and trains faster, but goes deeper on informative paths and overfits more readily on small data — so it needs num_leaves/max_depth/min_data_in_leaf limits. Both use histogram-binned features.

Question 14

k-means minimizes within-cluster sum of squares. Prove it monotonically decreases the objective, and explain why it can still reach a bad local optimum.

Accepted Answer

The objective is J=\sum_i \lVert x_i-\mu_{c_i}\rVert^2. The assignment step puts each x_i at its nearest centroid, which can only lower or keep J for fixed centroids. The update step sets each \mu_k to its members' mean — the unique minimizer of \sum_{i\in k}\lVert x_i-\mu_k\rVert^2 — which can only lower or keep J for fixed assignments. Both steps are non-increasing and J\ge0, so it converges. But J is non-convex over the combinatorial assignments, so it reaches only a local optimum; bad initialization can trap it. Mitigate with k-means++ seeding and multiple restarts.

Question 15

PCA can be framed as maximizing variance or minimizing reconstruction error. Show these are equivalent and tie it to the covariance eigendecomposition.

Accepted Answer

For centered data, projecting onto unit vector w gives variance w^	op \Sigma w with \Sigma=\frac1n X^	op X. The reconstruction error of projecting onto w and back is \mathbb{E}\lVert x-ww^	op xVert^2=\mathbb{E}\lVert xVert^2-w^	op\Sigma w. Since \mathbb{E}\lVert xVert^2 is fixed, minimizing reconstruction error is exactly maximizing projected variance. Maximizing w^	op\Sigma w s.t. \lVert wVert=1 via Lagrange multipliers gives \Sigma w=\lambda w: the optimal directions are eigenvectors of \Sigma and the captured variance equals \lambda. So the top-k principal components are the k largest-eigenvalue eigenvectors.

Question 16

Compare single-linkage and complete-linkage agglomerative clustering: what failure mode does each have?

Accepted Answer

Single linkage uses the minimum pairwise distance, merging via the nearest bridge — it can recover elongated, non-convex shapes but suffers 'chaining', where a thin string of points links two otherwise-distinct clusters, and is sensitive to noise/outliers acting as bridges. Complete linkage uses the maximum pairwise distance, producing compact, roughly equal-diameter clusters, but is biased toward spherical clusters, breaks up large or elongated true clusters, and is sensitive to outliers that inflate the max. Average linkage (UPGMA) and Ward's (minimize variance increase) sit between; Ward tends to give balanced, compact clusters and is a common default.

Question 17

You have 5,000 examples, ~1.2M sparse binary text features, and need calibrated probabilities and interpretable coefficients fast. Justify a model choice over a gradient-boosted tree ensemble.

Accepted Answer

L1/L2-regularized logistic regression is the right call. It handles ultra-high-dimensional sparse input cheaply (cost scales with non-zeros), gives directly interpretable per-feature coefficients, and outputs well-calibrated probabilities since it optimizes log-loss directly. With only 5,000 rows against 1.2M features, GBDTs would overfit, are slow to find good axis-aligned splits in extreme sparsity, fragment signal across many weak features, and yield less interpretable, often miscalibrated scores needing post-hoc calibration. L1 also produces a sparse, feature-selected model. If strong non-linear interactions matter and data grows, revisit with a linear SVM or hashed-feature GBDT.

Question 18

Show that k-means is the hard-assignment, equal-spherical-covariance limit of a GMM fit by EM.

Accepted Answer

A GMM with isotropic covariances \Sigma_k=\sigma^2 I and equal mixing weights has E-step responsibility \gamma_{ik}\propto\exp(-\lVert x_i-\mu_kVert^2/2\sigma^2). As \sigma^2	o0 the softmax becomes a hard argmax — the nearest centroid gets responsibility 1 and the rest 0, exactly k-means' assignment step. The M-step sets \mu_k to the responsibility-weighted mean, which collapses to the plain mean of assigned points — k-means' update. So k-means is EM on a shared-isotropic-covariance GMM shrunk to zero variance, doing winner-take-all assignments. This is why k-means assumes spherical, equal-size clusters and fails on elongated or unequal-variance ones.

Question 19

Why does EM for GMMs never decrease the data log-likelihood, and what does the ELBO have to do with it?

Accepted Answer

EM maximizes a lower bound on \log p(X\mid	heta). With a distribution q over latent assignments, Jensen gives \log p(X\mid	heta)\ge \mathbb{E}_q[\log p(X,Z\mid	heta)]-\mathbb{E}_q[\log q] — the ELBO — with gap exactly \mathrm{KL}(q\,\|\,p(Z\mid X,	heta)). The E-step sets q=p(Z\mid X,	heta^{old}), zeroing the KL so the bound touches the likelihood. The M-step maximizes the ELBO over 	heta, raising it. Since after the E-step the bound equals the likelihood, raising it raises the likelihood, so each iteration is non-decreasing. It converges to a local maximum (likelihood is unbounded if a component collapses onto one point).

Question 20

A candidate uses t-SNE coordinates to argue cluster A is 'twice as far' from B as from C, and that A's tightness means low intrinsic variance. What's wrong?

Accepted Answer

t-SNE preserves local neighborhoods, not global geometry, so inter-cluster distances and the gaps between clusters aren't meaningful — relative cluster distances and sizes are largely artifacts of the perplexity and the crowding correction (Student-t in the map). Cluster density/size in the embedding doesn't reflect true variance; t-SNE expands dense regions and contracts sparse ones. The layout also shifts with perplexity, seed, and iterations, and can manufacture apparent clusters from uniform data. Use it for qualitative neighborhood inspection only; for distance claims use the original space, PCA, or metric MDS.

Question 21

How does UMAP differ from t-SNE in its objective and what it preserves, and why is it usually faster?

Accepted Answer

UMAP models the data as a fuzzy topological structure (a weighted k-NN graph from local connectivity) and optimizes a cross-entropy between high- and low-dimensional fuzzy simplicial sets, whereas t-SNE minimizes KL divergence between Gaussian/Student-t neighbor probabilities. UMAP's cross-entropy balances attractive (keep neighbors close) and repulsive (push non-neighbors apart) terms more symmetrically, so it tends to retain more global structure while remaining local-first. It's faster via approximate nearest-neighbor search (NN-descent), negative sampling for the repulsion, and SGD on the graph. Like t-SNE, absolute distances and cluster sizes stay only loosely meaningful.

Classical ML (Supervised & Unsupervised)