Q: A test for a disease has 99% sensitivity and 95% specificity. Prevalence is 0.5%. Using Bayes' rule, what is the probability a person who tests positive actually has the disease?

P(D|+)=\frac{P(+|D)P(D)}{P(+|D)P(D)+P(+|\neg D)P(\neg D)}. Numerator =0.99\times0.005=0.00495; false-positive term =0.05\times0.995=0.04975. So P(D|+)=0.00495/0.0547\approx 0.0905, about 9%. Despite a seemingly accurate test, the low base rate means most positives are false positives — the base-rate fallacy. This is why screening rare conditions requires confirmatory testing.

Q: How does MAP estimation relate to MLE, and what does adding a Gaussian prior on the weights correspond to in a regression loss?

MAP maximizes the posterior \propto likelihood \times prior: \arg\max_\theta \log p(D|\theta)+\log p(\theta). MLE drops the prior term (equivalently assumes a flat/improper prior). A zero-mean Gaussian prior \theta\sim N(0,\tau^2 I) contributes -\frac{1}{2\tau^2}\|\theta\|^2 to the log-posterior, which is exactly L2 (ridge) regularization with \lambda=\sigma^2/\tau^2. A Laplace prior gives L1/lasso. As data grows, the likelihood dominates and MAP converges to MLE.

Q: What is the geometric and algebraic meaning of the SVD, and how does it connect to eigendecomposition?

SVD factors any m\times n matrix as A=U\Sigma V^\top, where U,V are orthonormal and \Sigma holds non-negative singular values. Geometrically it decomposes the linear map into a rotation/reflection (V^\top), an axis-aligned scaling (\Sigma), and another rotation (U). The right singular vectors V are eigenvectors of A^\top A, the left U are eigenvectors of AA^\top, and the singular values are \sqrt{\lambda_i} of those Gram matrices. Unlike eigendecomposition, SVD exists for any matrix, including rectangular and rank-deficient ones.

Q: Precisely state what a p-value is and what a 95% confidence interval is, correcting the two most common misinterpretations.

A p-value is P(\text{data at least as extreme}\mid H_0\text{ true}) — NOT the probability H_0 is true, and not the probability the result occurred by chance. A 95% CI means: if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter. It does NOT mean there is a 95% probability the true value lies in this particular interval — the parameter is fixed; the interval is random. Both are frequentist statements about long-run procedure behavior, not about a single hypothesis or interval.

Q: Why is KL divergence not a distance metric, and when would you use forward KL $D(p\|q)$ versus reverse KL $D(q\|p)$ in variational inference?

KL is not a metric: it is asymmetric (D(p\|q)\ne D(q\|p)) and violates the triangle inequality, though it is non-negative and zero iff p=q. Forward KL D(p\|q) is mass-covering/mean-seeking — q must put mass wherever p does, so it over-spreads (this is what MLE/cross-entropy minimize). Reverse KL D(q\|p) is mode-seeking/zero-forcing — q avoids regions where p\approx0, locking onto one mode. Mean-field variational inference minimizes reverse KL, which is why it tends to underestimate posterior variance.

Question 1

What is the difference between a covariance matrix and a correlation matrix, and why is the covariance matrix always positive semi-definite?

Accepted Answer

Covariance \Sigma_{ij}=\mathrm{Cov}(X_i,X_j) captures linear co-variation in the variables' original units; the correlation matrix normalizes each entry by \sigma_i\sigma_j, giving unit-free values in [-1,1] with a diagonal of 1. The covariance matrix is PSD because for any vector a, a^	op\Sigma a = \mathrm{Var}(a^	op X)\ge 0 — a variance can never be negative. This guarantees real, non-negative eigenvalues, which is what makes PCA and Gaussian densities well-defined.

Question 2

Define entropy, cross-entropy, and KL divergence, and state the exact relationship among them.

Accepted Answer

Entropy H(p)=-\sum_x p(x)\log p(x) is the average bits to encode samples from p under an optimal code. Cross-entropy H(p,q)=-\sum_x p(x)\log q(x) is the cost of coding p's samples with a code optimized for q. KL divergence D_{KL}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)} is the excess cost. They relate as H(p,q)=H(p)+D_{KL}(p\|q). Since H(p) is fixed w.r.t. model q, minimizing cross-entropy is equivalent to minimizing KL to the data distribution.

Question 3

A test for a disease has 99% sensitivity and 95% specificity. Prevalence is 0.5%. Using Bayes' rule, what is the probability a person who tests positive actually has the disease?

Accepted Answer

P(D|+)=\frac{P(+|D)P(D)}{P(+|D)P(D)+P(+|
eg D)P(
eg D)}. Numerator =0.99	imes0.005=0.00495; false-positive term =0.05	imes0.995=0.04975. So P(D|+)=0.00495/0.0547\approx 0.0905, about 9%. Despite a seemingly accurate test, the low base rate means most positives are false positives — the base-rate fallacy. This is why screening rare conditions requires confirmatory testing.

Question 4

Explain the curse of dimensionality and give two concrete ways it degrades ML models.

Accepted Answer

As dimensionality d grows, volume grows exponentially, so a fixed sample covers a vanishing fraction of the space — data becomes sparse. Concretely: (1) distance concentration — the ratio of nearest to farthest neighbor distances approaches 1, so k-NN and clustering lose discriminative power; (2) sample complexity — maintaining density needs exponentially more points, so density estimation and nonparametric methods fail. It also inflates variance and overfitting risk. Mitigations: dimensionality reduction (PCA), feature selection, and strong inductive biases/regularization.

Question 5

Derive the maximum likelihood estimator for the mean and variance of a Gaussian, and explain why the MLE variance is biased.

Accepted Answer

Log-likelihood: \ell=-\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum(x_i-\mu)^2. Setting \partial\ell/\partial\mu=0 gives \hat\mu=\frac1n\sum x_i; setting \partial\ell/\partial\sigma^2=0 gives \hat\sigma^2=\frac1n\sum(x_i-\hat\mu)^2. It is biased because using \hat\mu (estimated from the same data) rather than the true \mu minimizes and thus understates the residual sum of squares; one degree of freedom is consumed. E[\hat\sigma^2]=\frac{n-1}{n}\sigma^2, so the unbiased estimator divides by n-1 (Bessel's correction).

Question 6

How does MAP estimation relate to MLE, and what does adding a Gaussian prior on the weights correspond to in a regression loss?

Accepted Answer

MAP maximizes the posterior \propto likelihood 	imes prior: \arg\max_	heta \log p(D|	heta)+\log p(	heta). MLE drops the prior term (equivalently assumes a flat/improper prior). A zero-mean Gaussian prior 	heta\sim N(0,	au^2 I) contributes -\frac{1}{2	au^2}\|	heta\|^2 to the log-posterior, which is exactly L2 (ridge) regularization with \lambda=\sigma^2/	au^2. A Laplace prior gives L1/lasso. As data grows, the likelihood dominates and MAP converges to MLE.

Question 7

What is the geometric and algebraic meaning of the SVD, and how does it connect to eigendecomposition?

Accepted Answer

SVD factors any m	imes n matrix as A=U\Sigma V^	op, where U,V are orthonormal and \Sigma holds non-negative singular values. Geometrically it decomposes the linear map into a rotation/reflection (V^	op), an axis-aligned scaling (\Sigma), and another rotation (U). The right singular vectors V are eigenvectors of A^	op A, the left U are eigenvectors of AA^	op, and the singular values are \sqrt{\lambda_i} of those Gram matrices. Unlike eigendecomposition, SVD exists for any matrix, including rectangular and rank-deficient ones.

Question 8

Precisely state what a p-value is and what a 95% confidence interval is, correcting the two most common misinterpretations.

Accepted Answer

A p-value is P(	ext{data at least as extreme}\mid H_0	ext{ true}) — NOT the probability H_0 is true, and not the probability the result occurred by chance. A 95% CI means: if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter. It does NOT mean there is a 95% probability the true value lies in this particular interval — the parameter is fixed; the interval is random. Both are frequentist statements about long-run procedure behavior, not about a single hypothesis or interval.

Question 9

Why is KL divergence not a distance metric, and when would you use forward KL $D(p\|q)$ versus reverse KL $D(q\|p)$ in variational inference?

Accepted Answer

KL is not a metric: it is asymmetric (D(p\|q)
e D(q\|p)) and violates the triangle inequality, though it is non-negative and zero iff p=q. Forward KL D(p\|q) is mass-covering/mean-seeking — q must put mass wherever p does, so it over-spreads (this is what MLE/cross-entropy minimize). Reverse KL D(q\|p) is mode-seeking/zero-forcing — q avoids regions where p\approx0, locking onto one mode. Mean-field variational inference minimizes reverse KL, which is why it tends to underestimate posterior variance.

Question 10

Derive the gradient of the softmax cross-entropy loss with respect to the logits and explain why the result is so clean.

Accepted Answer

Let p_i=\frac{e^{z_i}}{\sum_j e^{z_j}} and loss L=-\sum_i y_i\log p_i for one-hot y. The softmax Jacobian is \partial p_i/\partial z_k=p_i(\delta_{ik}-p_k). Chaining with \partial L/\partial p_i=-y_i/p_i and using \sum_i y_i=1, the terms telescope to \partial L/\partial z_k=p_k-y_k. It is clean because softmax is the canonical link of the categorical exponential family — for any exponential-family GLM paired with its matching (negative log-likelihood) loss, the logit gradient is predicted minus observed.

Question 11

Explain the bias-variance decomposition of expected squared error and write it out formally for a single prediction.

Accepted Answer

For target y=f(x)+\epsilon with \mathrm{Var}(\epsilon)=\sigma^2, the expected squared error of estimator \hat f over training sets is E[(y-\hat f(x))^2]=\underbrace{(f(x)-E[\hat f(x)])^2}_{	ext{bias}^2}+\underbrace{E[(\hat f(x)-E[\hat f(x)])^2]}_{	ext{variance}}+\underbrace{\sigma^2}_{	ext{irreducible}}. Bias is systematic error from model misspecification/underfitting; variance is sensitivity to the particular training sample/overfitting; the irreducible term is noise no model can remove. Increasing complexity trades bias down for variance up. This clean additive decomposition is specific to squared loss.

Question 12

What is the multiple comparisons problem, and how do Bonferroni and Benjamini-Hochberg differ in what they control?

Accepted Answer

Running m tests at level \alpha inflates the chance of at least one false positive to \approx 1-(1-\alpha)^m. Bonferroni controls the family-wise error rate (FWER) — probability of any false positive — by testing each at \alpha/m; simple but very conservative, killing power as m grows. Benjamini-Hochberg controls the false discovery rate (FDR) — expected proportion of false positives among rejections — by ranking p-values and rejecting up to the largest k with p_{(k)}\le \frac{k}{m}\alpha. BH is far more powerful and standard in high-dimensional settings like genomics.

Question 13

Why is the log-sum-exp trick used in softmax and log-likelihood computations? Explain the numerical issue and the fix.

Accepted Answer

Computing \log\sum_i e^{z_i} directly overflows when any z_i is large (e.g. e^{1000}=\infty) and underflows to 0 for very negative logits, corrupting the result. The fix factors out the max: \log\sum_i e^{z_i}=c+\log\sum_i e^{z_i-c} with c=\max_i z_i. Now the largest exponent is e^0=1, so no overflow, and the dominant term never underflows. Softmax uses the same shift: p_i=e^{z_i-c}/\sum_j e^{z_j-c}, which is invariant to c. This is why production loss code never exponentiates raw logits.

Question 14

A colleague computes a 95% confidence interval $[2.1, 4.3]$ for a model's mean improvement in F1 and concludes 'there is a 95% probability the true improvement lies in $[2.1, 4.3]$.' Why is this wrong, and what is the correct frequentist interpretation?

Accepted Answer

In the frequentist framework the true parameter is fixed, not random, so a given realized interval either contains it or doesn't — the probability is 0 or 1, not 0.95. The 95% refers to the *procedure*: if you repeated the experiment many times and built a CI each way, ~95% of those intervals would cover the true value. It is a statement about long-run coverage of the method, not about this one interval. The 'probability the parameter is in [2.1,4.3]' framing is a Bayesian credible-interval statement, which requires a prior and a posterior — a different object entirely.

Question 15

Why are the eigenvectors of the data covariance matrix exactly the principal components, and what does each eigenvalue represent? Tie it to the Rayleigh quotient.

Accepted Answer

PCA seeks the unit direction w maximizing projected variance w^	op\Sigma w — a Rayleigh quotient. Its maximum over unit vectors is the largest eigenvalue \lambda_1, attained at eigenvector v_1; subsequent components maximize variance subject to orthogonality, giving v_2,v_3,\dots in decreasing \lambda order. Each eigenvalue \lambda_i is the variance captured along v_i, so \lambda_i/\sum_j\lambda_j is the explained-variance ratio. Because \Sigma is symmetric PSD, the eigenvectors form an orthonormal basis and eigenvalues are real and non-negative.

Question 16

A staff interviewer claims 'the central limit theorem guarantees your sample mean is Gaussian, so you can always use a z-test.' Where does this reasoning break down?

Accepted Answer

Several failure modes. (1) The CLT is asymptotic — for small n or heavily skewed/heavy-tailed data the mean's distribution is far from Gaussian, so use a t-test (and the z-test also assumes known variance). (2) It requires finite variance; for Cauchy-like tails the sample mean never converges to a Gaussian (it stays Cauchy). (3) It requires (near-)i.i.d. samples — autocorrelated or non-stationary data inflate effective variance, so naive standard errors are too small. (4) It concerns the mean's distribution, not individual points, and convergence is only O(1/\sqrt n) — slow under high skew. Approximate normality of the mean also doesn't license tests on other statistics.

Question 17

Define Fisher information and explain its role in the Cramér-Rao bound and in why MLE is asymptotically optimal.

Accepted Answer

Fisher information I(	heta)=E[(\partial_	heta \log p(x;	heta))^2]=-E[\partial^2_	heta \log p(x;	heta)] measures the curvature/sharpness of the log-likelihood — how much a sample reveals about 	heta. The Cramér-Rao bound states any unbiased estimator has variance \ge 1/I(	heta) (or I^{-1} in the matrix case), a fundamental floor on precision. The MLE is asymptotically efficient: consistent, asymptotically unbiased, with \sqrt n(\hat	heta-	heta)	o N(0, I_1(	heta)^{-1}), attaining the CRB in the limit. This is why I^{-1} gives asymptotic standard errors and is the natural metric in natural-gradient methods.

Question 18

What is the relationship between minimizing cross-entropy loss and maximum likelihood estimation, and why does this make cross-entropy the 'natural' classification loss?

Accepted Answer

For a model outputting class probabilities q_	heta(y|x), the data log-likelihood is \sum_n \log q_	heta(y_n|x_n). Negating and averaging gives exactly the empirical cross-entropy -\frac1N\sum_n \log q_	heta(y_n|x_n), so minimizing cross-entropy is identical to maximizing likelihood. It is natural because it is the proper scoring rule matching the categorical/Bernoulli likelihood — its gradient is the clean predicted-minus-observed form, it is convex in the logits for linear models, and it penalizes confident wrong predictions unboundedly, pushing calibrated probabilities rather than just correct argmax like 0-1 or hinge loss.

Question 19

You run an A/B test on 20 model variants vs control at $\alpha=0.05$ and find 2 'significant' wins. A skeptic says you've found nothing. Explain the multiple-comparisons problem, why the naive p-values mislead, and how Bonferroni vs Benjamini-Hochberg differ in what they control.

Accepted Answer

With 20 independent tests under the null, the family-wise probability of \geq 1 false positive is 1-0.95^{20}\approx 0.64 — so 2 'wins' is entirely consistent with pure noise; per-test p-values overstate evidence. Bonferroni controls the family-wise error rate (FWER, probability of *any* false positive) by testing each at \alpha/m=0.0025; it's conservative and loses power as m grows. Benjamini-Hochberg controls the false discovery rate (FDR, expected *proportion* of false positives among rejections): rank p-values, find largest k with p_{(k)}\leq \frac{k}{m}\alpha. BH is more powerful, appropriate when some false discoveries are tolerable, as in screening many variants.

Question 20

A p-value of 0.04 is reported for a new model beating baseline. List the distinct things this p-value does NOT tell you, and explain the difference between statistical and practical significance plus how power and sample size distort the picture.

Accepted Answer

It does NOT give: the probability the null is true, the probability your hypothesis is true, the probability of replication, or the effect size/its importance. p=P(	ext{data this extreme}\mid H_0), not P(H_0\mid	ext{data}) — conflating them is the prosecutor's fallacy. Statistical significance only says the effect is detectable given n; with huge n a trivial 0.1% F1 gain becomes 'significant' yet practically worthless, while an underpowered study can miss a large real effect (Type II), and significant results from low-power studies have inflated effect sizes (winner's curse / Type M error). Always report effect size and a CI, not just p.

Math & Statistics for ML