What is the difference between a covariance matrix and a correlation matrix, and why is the covariance matrix always positive semi-definite?
Covariance $\Sigma_{ij}=\mathrm{Cov}(X_i,X_j)$ captures linear co-variation in the variables' original units; the correlation matrix normalizes each entry by $\sigma_i\sigma_j$, giving unit-free values in $[-1,1]$ with a diagonal of 1. The covariance matrix is PSD because for any vector $a$, $a^\top\Sigma a = \mathrm{Var}(a^\top X)\ge 0$ — a variance can never be negative. This guarantees real, non-negative eigenvalues, which is what makes PCA and Gaussian densities well-defined.
#covariance#correlation#psd#linear-algebra
Foundationalconcept
Define entropy, cross-entropy, and KL divergence, and state the exact relationship among them.
Entropy $H(p)=-\sum_x p(x)\log p(x)$ is the average bits to encode samples from $p$ under an optimal code. Cross-entropy $H(p,q)=-\sum_x p(x)\log q(x)$ is the cost of coding $p$'s samples with a code optimized for $q$. KL divergence $D_{KL}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)}$ is the excess cost. They relate as $H(p,q)=H(p)+D_{KL}(p\|q)$. Since $H(p)$ is fixed w.r.t. model $q$, minimizing cross-entropy is equivalent to minimizing KL to the data distribution.
A test for a disease has 99% sensitivity and 95% specificity. Prevalence is 0.5%. Using Bayes' rule, what is the probability a person who tests positive actually has the disease?
$P(D|+)=\frac{P(+|D)P(D)}{P(+|D)P(D)+P(+|\neg D)P(\neg D)}$. Numerator $=0.99\times0.005=0.00495$; false-positive term $=0.05\times0.995=0.04975$. So $P(D|+)=0.00495/0.0547\approx 0.0905$, about 9%. Despite a seemingly accurate test, the low base rate means most positives are false positives — the base-rate fallacy. This is why screening rare conditions requires confirmatory testing.
#bayes#base-rate#probability#conditional
Foundationalconcept
Explain the curse of dimensionality and give two concrete ways it degrades ML models.
As dimensionality $d$ grows, volume grows exponentially, so a fixed sample covers a vanishing fraction of the space — data becomes sparse. Concretely: (1) distance concentration — the ratio of nearest to farthest neighbor distances approaches 1, so $k$-NN and clustering lose discriminative power; (2) sample complexity — maintaining density needs exponentially more points, so density estimation and nonparametric methods fail. It also inflates variance and overfitting risk. Mitigations: dimensionality reduction (PCA), feature selection, and strong inductive biases/regularization.
Derive the maximum likelihood estimator for the mean and variance of a Gaussian, and explain why the MLE variance is biased.
Log-likelihood: $\ell=-\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum(x_i-\mu)^2$. Setting $\partial\ell/\partial\mu=0$ gives $\hat\mu=\frac1n\sum x_i$; setting $\partial\ell/\partial\sigma^2=0$ gives $\hat\sigma^2=\frac1n\sum(x_i-\hat\mu)^2$. It is biased because using $\hat\mu$ (estimated from the same data) rather than the true $\mu$ minimizes and thus understates the residual sum of squares; one degree of freedom is consumed. $E[\hat\sigma^2]=\frac{n-1}{n}\sigma^2$, so the unbiased estimator divides by $n-1$ (Bessel's correction).
#mle#gaussian#bias#estimation
Intermediateconcept
How does MAP estimation relate to MLE, and what does adding a Gaussian prior on the weights correspond to in a regression loss?
MAP maximizes the posterior $\propto$ likelihood $\times$ prior: $\arg\max_\theta \log p(D|\theta)+\log p(\theta)$. MLE drops the prior term (equivalently assumes a flat/improper prior). A zero-mean Gaussian prior $\theta\sim N(0,\tau^2 I)$ contributes $-\frac{1}{2\tau^2}\|\theta\|^2$ to the log-posterior, which is exactly L2 (ridge) regularization with $\lambda=\sigma^2/\tau^2$. A Laplace prior gives L1/lasso. As data grows, the likelihood dominates and MAP converges to MLE.
#map#mle#prior#regularization
Intermediateconcept
What is the geometric and algebraic meaning of the SVD, and how does it connect to eigendecomposition?
SVD factors any $m\times n$ matrix as $A=U\Sigma V^\top$, where $U,V$ are orthonormal and $\Sigma$ holds non-negative singular values. Geometrically it decomposes the linear map into a rotation/reflection ($V^\top$), an axis-aligned scaling ($\Sigma$), and another rotation ($U$). The right singular vectors $V$ are eigenvectors of $A^\top A$, the left $U$ are eigenvectors of $AA^\top$, and the singular values are $\sqrt{\lambda_i}$ of those Gram matrices. Unlike eigendecomposition, SVD exists for any matrix, including rectangular and rank-deficient ones.
#svd#eigendecomposition#linear-algebra#matrix
Intermediateconcept
Precisely state what a p-value is and what a 95% confidence interval is, correcting the two most common misinterpretations.
A p-value is $P(\text{data at least as extreme}\mid H_0\text{ true})$ — NOT the probability $H_0$ is true, and not the probability the result occurred by chance. A 95% CI means: if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter. It does NOT mean there is a 95% probability the true value lies in this particular interval — the parameter is fixed; the interval is random. Both are frequentist statements about long-run procedure behavior, not about a single hypothesis or interval.
Why is KL divergence not a distance metric, and when would you use forward KL $D(p\|q)$ versus reverse KL $D(q\|p)$ in variational inference?
KL is not a metric: it is asymmetric ($D(p\|q)\ne D(q\|p)$) and violates the triangle inequality, though it is non-negative and zero iff $p=q$. Forward KL $D(p\|q)$ is mass-covering/mean-seeking — $q$ must put mass wherever $p$ does, so it over-spreads (this is what MLE/cross-entropy minimize). Reverse KL $D(q\|p)$ is mode-seeking/zero-forcing — $q$ avoids regions where $p\approx0$, locking onto one mode. Mean-field variational inference minimizes reverse KL, which is why it tends to underestimate posterior variance.
Derive the gradient of the softmax cross-entropy loss with respect to the logits and explain why the result is so clean.
Let $p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}$ and loss $L=-\sum_i y_i\log p_i$ for one-hot $y$. The softmax Jacobian is $\partial p_i/\partial z_k=p_i(\delta_{ik}-p_k)$. Chaining with $\partial L/\partial p_i=-y_i/p_i$ and using $\sum_i y_i=1$, the terms telescope to $\partial L/\partial z_k=p_k-y_k$. It is clean because softmax is the canonical link of the categorical exponential family — for any exponential-family GLM paired with its matching (negative log-likelihood) loss, the logit gradient is predicted minus observed.
#softmax#cross-entropy#gradient#backprop
Advancedmath
Explain the bias-variance decomposition of expected squared error and write it out formally for a single prediction.
For target $y=f(x)+\epsilon$ with $\mathrm{Var}(\epsilon)=\sigma^2$, the expected squared error of estimator $\hat f$ over training sets is $E[(y-\hat f(x))^2]=\underbrace{(f(x)-E[\hat f(x)])^2}_{\text{bias}^2}+\underbrace{E[(\hat f(x)-E[\hat f(x)])^2]}_{\text{variance}}+\underbrace{\sigma^2}_{\text{irreducible}}$. Bias is systematic error from model misspecification/underfitting; variance is sensitivity to the particular training sample/overfitting; the irreducible term is noise no model can remove. Increasing complexity trades bias down for variance up. This clean additive decomposition is specific to squared loss.
What is the multiple comparisons problem, and how do Bonferroni and Benjamini-Hochberg differ in what they control?
Running $m$ tests at level $\alpha$ inflates the chance of at least one false positive to $\approx 1-(1-\alpha)^m$. Bonferroni controls the family-wise error rate (FWER) — probability of any false positive — by testing each at $\alpha/m$; simple but very conservative, killing power as $m$ grows. Benjamini-Hochberg controls the false discovery rate (FDR) — expected proportion of false positives among rejections — by ranking p-values and rejecting up to the largest $k$ with $p_{(k)}\le \frac{k}{m}\alpha$. BH is far more powerful and standard in high-dimensional settings like genomics.
Why is the log-sum-exp trick used in softmax and log-likelihood computations? Explain the numerical issue and the fix.
Computing $\log\sum_i e^{z_i}$ directly overflows when any $z_i$ is large (e.g. $e^{1000}=\infty$) and underflows to 0 for very negative logits, corrupting the result. The fix factors out the max: $\log\sum_i e^{z_i}=c+\log\sum_i e^{z_i-c}$ with $c=\max_i z_i$. Now the largest exponent is $e^0=1$, so no overflow, and the dominant term never underflows. Softmax uses the same shift: $p_i=e^{z_i-c}/\sum_j e^{z_j-c}$, which is invariant to $c$. This is why production loss code never exponentiates raw logits.
#log-sum-exp#numerical-stability#softmax#overflow
Advancedconcept
A colleague computes a 95% confidence interval $[2.1, 4.3]$ for a model's mean improvement in F1 and concludes 'there is a 95% probability the true improvement lies in $[2.1, 4.3]$.' Why is this wrong, and what is the correct frequentist interpretation?
In the frequentist framework the true parameter is fixed, not random, so a given realized interval either contains it or doesn't — the probability is 0 or 1, not 0.95. The 95% refers to the *procedure*: if you repeated the experiment many times and built a CI each way, ~95% of those intervals would cover the true value. It is a statement about long-run coverage of the method, not about this one interval. The 'probability the parameter is in $[2.1,4.3]$' framing is a Bayesian credible-interval statement, which requires a prior and a posterior — a different object entirely.
Why are the eigenvectors of the data covariance matrix exactly the principal components, and what does each eigenvalue represent? Tie it to the Rayleigh quotient.
PCA seeks the unit direction $w$ maximizing projected variance $w^\top\Sigma w$ — a Rayleigh quotient. Its maximum over unit vectors is the largest eigenvalue $\lambda_1$, attained at eigenvector $v_1$; subsequent components maximize variance subject to orthogonality, giving $v_2,v_3,\dots$ in decreasing $\lambda$ order. Each eigenvalue $\lambda_i$ is the variance captured along $v_i$, so $\lambda_i/\sum_j\lambda_j$ is the explained-variance ratio. Because $\Sigma$ is symmetric PSD, the eigenvectors form an orthonormal basis and eigenvalues are real and non-negative.
#pca#eigenvectors#rayleigh-quotient#variance
Expertconcept
A staff interviewer claims 'the central limit theorem guarantees your sample mean is Gaussian, so you can always use a z-test.' Where does this reasoning break down?
Several failure modes. (1) The CLT is asymptotic — for small $n$ or heavily skewed/heavy-tailed data the mean's distribution is far from Gaussian, so use a $t$-test (and the z-test also assumes known variance). (2) It requires finite variance; for Cauchy-like tails the sample mean never converges to a Gaussian (it stays Cauchy). (3) It requires (near-)i.i.d. samples — autocorrelated or non-stationary data inflate effective variance, so naive standard errors are too small. (4) It concerns the mean's distribution, not individual points, and convergence is only $O(1/\sqrt n)$ — slow under high skew. Approximate normality of the mean also doesn't license tests on other statistics.
#clt#z-test#heavy-tails#iid
Expertmath
Define Fisher information and explain its role in the Cramér-Rao bound and in why MLE is asymptotically optimal.
Fisher information $I(\theta)=E[(\partial_\theta \log p(x;\theta))^2]=-E[\partial^2_\theta \log p(x;\theta)]$ measures the curvature/sharpness of the log-likelihood — how much a sample reveals about $\theta$. The Cramér-Rao bound states any unbiased estimator has variance $\ge 1/I(\theta)$ (or $I^{-1}$ in the matrix case), a fundamental floor on precision. The MLE is asymptotically efficient: consistent, asymptotically unbiased, with $\sqrt n(\hat\theta-\theta)\to N(0, I_1(\theta)^{-1})$, attaining the CRB in the limit. This is why $I^{-1}$ gives asymptotic standard errors and is the natural metric in natural-gradient methods.
#fisher-information#cramer-rao#mle#efficiency
Expertconcept
What is the relationship between minimizing cross-entropy loss and maximum likelihood estimation, and why does this make cross-entropy the 'natural' classification loss?
For a model outputting class probabilities $q_\theta(y|x)$, the data log-likelihood is $\sum_n \log q_\theta(y_n|x_n)$. Negating and averaging gives exactly the empirical cross-entropy $-\frac1N\sum_n \log q_\theta(y_n|x_n)$, so minimizing cross-entropy is identical to maximizing likelihood. It is natural because it is the proper scoring rule matching the categorical/Bernoulli likelihood — its gradient is the clean predicted-minus-observed form, it is convex in the logits for linear models, and it penalizes confident wrong predictions unboundedly, pushing calibrated probabilities rather than just correct argmax like 0-1 or hinge loss.
You run an A/B test on 20 model variants vs control at $\alpha=0.05$ and find 2 'significant' wins. A skeptic says you've found nothing. Explain the multiple-comparisons problem, why the naive p-values mislead, and how Bonferroni vs Benjamini-Hochberg differ in what they control.
With 20 independent tests under the null, the family-wise probability of $\geq 1$ false positive is $1-0.95^{20}\approx 0.64$ — so 2 'wins' is entirely consistent with pure noise; per-test p-values overstate evidence. Bonferroni controls the family-wise error rate (FWER, probability of *any* false positive) by testing each at $\alpha/m=0.0025$; it's conservative and loses power as $m$ grows. Benjamini-Hochberg controls the false discovery rate (FDR, expected *proportion* of false positives among rejections): rank p-values, find largest $k$ with $p_{(k)}\leq \frac{k}{m}\alpha$. BH is more powerful, appropriate when some false discoveries are tolerable, as in screening many variants.
A p-value of 0.04 is reported for a new model beating baseline. List the distinct things this p-value does NOT tell you, and explain the difference between statistical and practical significance plus how power and sample size distort the picture.
It does NOT give: the probability the null is true, the probability your hypothesis is true, the probability of replication, or the effect size/its importance. $p=P(\text{data this extreme}\mid H_0)$, not $P(H_0\mid\text{data})$ — conflating them is the prosecutor's fallacy. Statistical significance only says the effect is detectable given $n$; with huge $n$ a trivial 0.1% F1 gain becomes 'significant' yet practically worthless, while an underpowered study can miss a large real effect (Type II), and significant results from low-power studies have inflated effect sizes (winner's curse / Type M error). Always report effect size and a CI, not just $p$.