Q: What is a perceptron, and what fundamental class of functions can a single-layer perceptron NOT represent?

A perceptron computes y=\text{step}(w\cdot x + b) — a linear threshold unit producing a binary output from a weighted sum of inputs. It can only separate linearly separable data: classes divisible by a hyperplane. It cannot represent XOR (or any non-linearly-separable function), the famous Minsky-Papert result. The fix is stacking layers with nonlinear activations (an MLP), which can carve nonconvex decision regions. A single linear layer with no nonlinearity, no matter how wide, stays a linear classifier.

Q: Why is softmax the standard output activation for multiclass classification, and what loss pairs with it?

Softmax maps logits to a normalized probability simplex: p_i=e^{z_i}/\sum_j e^{z_j}, positive and summing to 1, so outputs are interpretable as class probabilities. It pairs with cross-entropy loss -\sum_i y_i\log p_i. The combination is special: the gradient of softmax-cross-entropy w.r.t. logits simplifies to p-y — clean, well-scaled, no vanishing factor. For numerical stability subtract \max(z) before exponentiating. For binary or multi-label problems use sigmoid + binary cross-entropy per output instead, since classes aren't mutually exclusive.

Q: Derive the backpropagation update for a single weight $w$ in a layer, stating the chain-rule structure and what gets cached on the forward pass.

Backprop applies the chain rule: \frac{\partial L}{\partial w_{ij}}=\frac{\partial L}{\partial z_j}\cdot\frac{\partial z_j}{\partial w_{ij}}=\delta_j a_i, where z_j=\sum_i w_{ij}a_i+b_j, a_i is the input activation, and \delta_j=\frac{\partial L}{\partial z_j} is the error signal. \delta propagates backward: \delta_j=\big(\sum_k w_{jk}\delta_k\big)\,g'(z_j) for hidden units, where g' is the activation derivative. The forward pass caches activations a_i and pre-activations z_j for the backward pass. The full layer gradient is the outer product \delta a^\top; cost is the same order as the forward pass.

Q: Explain SGD with momentum. What problem does momentum solve and how does the velocity term work?

Plain SGD updates \theta\leftarrow\theta-\eta\nabla L; it zig-zags in ravines (high curvature in one direction, low in another) and crawls through flat/noisy regions. Momentum maintains a velocity v\leftarrow\beta v+\nabla L, then \theta\leftarrow\theta-\eta v, with \beta\approx0.9. It accumulates consistent gradient directions while canceling oscillating components, dampening zig-zag and accelerating along persistent slopes — like a heavy ball with inertia. Nesterov momentum evaluates the gradient at the look-ahead point \theta-\eta\beta v, giving a correction that often converges faster and more stably.

Q: Walk through the Adam optimizer's update equations and explain the role of bias correction.

Adam keeps EMAs of the gradient (first moment m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t) and squared gradient (second moment v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2). Because m,v initialize at zero they're biased toward zero early on, so bias-correct: \hat m_t=m_t/(1-\beta_1^t), \hat v_t=v_t/(1-\beta_2^t). Update: \theta\leftarrow\theta-\eta\,\hat m_t/(\sqrt{\hat v_t}+\epsilon). This gives per-parameter adaptive step sizes. Defaults: \beta_1{=}0.9,\beta_2{=}0.999,\epsilon{=}10^{-8}. Without bias correction the first steps would be far too small.

Q: What is the actual difference between Adam and AdamW, and why does it matter?

Adam with L2 'weight decay' folds the penalty into the gradient (g+\lambda\theta), which then passes through the adaptive denominator \sqrt{\hat v} — so parameters with large gradient variance get *less* effective decay, coupling regularization to gradient scale. AdamW decouples weight decay from the gradient: \theta\leftarrow\theta-\eta(\hat m/(\sqrt{\hat v}+\epsilon)+\lambda\theta), shrinking weights directly and uniformly. This makes decay behave as intended, improves generalization, and decouples optimal learning rate from optimal decay. AdamW is the de-facto standard for training transformers and large models.

Q: Compare He (Kaiming) and Xavier (Glorot) initialization. When do you use each and what variance do they target?

Both keep activation/gradient variance roughly constant across layers to avoid vanishing/exploding signals. Xavier/Glorot targets \text{Var}(W)=\frac{2}{n_{in}+n_{out}} (or \frac{1}{n_{in}}), derived assuming a linear/tanh-like, zero-centered, symmetric activation. He init targets \text{Var}(W)=\frac{2}{n_{in}}, doubling the variance to account for ReLU zeroing half the inputs (halving variance). Rule of thumb: Xavier for tanh/sigmoid, He for ReLU/variants. Wrong init in deep nets causes signals to shrink or blow up exponentially with depth, stalling or destabilizing training before normalization/residuals can help.

Q: BatchNorm behaves differently during training and inference. Explain precisely what statistics are used in each phase and why the switch is necessary.

During training, BN normalizes each activation using the mean and variance of the current mini-batch, then applies learnable scale \gamma and shift \beta. It simultaneously maintains an exponential moving average of batch mean/variance (the running stats). At inference, the batch is gone (you may even predict single examples), so BN uses these fixed running stats instead of per-batch statistics. Without the switch, inference output would depend on whichever other examples happened to share the batch, making predictions non-deterministic and batch-composition-dependent. In frameworks you toggle this with model.eval() / training=False; forgetting it is a classic silent bug.

Question 1

What is a perceptron, and what fundamental class of functions can a single-layer perceptron NOT represent?

Accepted Answer

A perceptron computes y=	ext{step}(w\cdot x + b) — a linear threshold unit producing a binary output from a weighted sum of inputs. It can only separate linearly separable data: classes divisible by a hyperplane. It cannot represent XOR (or any non-linearly-separable function), the famous Minsky-Papert result. The fix is stacking layers with nonlinear activations (an MLP), which can carve nonconvex decision regions. A single linear layer with no nonlinearity, no matter how wide, stays a linear classifier.

Question 2

Why must an MLP use nonlinear activation functions between layers? What collapses if you don't?

Accepted Answer

Without nonlinearity, composing linear layers W_2(W_1x+b_1)+b_2 simplifies to a single linear map W'x+b' — the network collapses to one affine transformation regardless of depth, so it can only learn linearly separable functions. Nonlinear activations (ReLU, GELU, tanh) let each layer warp the representation space, giving the network universal-approximation capacity for arbitrary continuous functions. The nonlinearity is what makes depth meaningful; it breaks the linear-composition collapse and creates the piecewise/curved decision boundaries that distinguish deep nets from linear models.

Question 3

Contrast ReLU, sigmoid, and tanh as hidden activations. Why did ReLU largely replace the sigmoidal pair in deep nets?

Accepted Answer

Sigmoid \sigma(x)\in(0,1) and tanh \in(-1,1) are smooth but saturate: their gradients vanish toward 0 for large |x|, so stacked layers suffer vanishing gradients and slow training; sigmoid is also non-zero-centered. ReLU \max(0,x) has gradient 1 on the positive side (no saturation there), is cheap, and induces sparsity — enabling much deeper trainable nets. ReLU's downsides: zero gradient for x<0 (the 'dying ReLU' problem) and non-zero-centered output; LeakyReLU/GELU address the dead-unit issue. tanh persists in RNN gates where bounded zero-centered output helps.

Question 4

Why is softmax the standard output activation for multiclass classification, and what loss pairs with it?

Accepted Answer

Softmax maps logits to a normalized probability simplex: p_i=e^{z_i}/\sum_j e^{z_j}, positive and summing to 1, so outputs are interpretable as class probabilities. It pairs with cross-entropy loss -\sum_i y_i\log p_i. The combination is special: the gradient of softmax-cross-entropy w.r.t. logits simplifies to p-y — clean, well-scaled, no vanishing factor. For numerical stability subtract \max(z) before exponentiating. For binary or multi-label problems use sigmoid + binary cross-entropy per output instead, since classes aren't mutually exclusive.

Question 5

Derive the backpropagation update for a single weight $w$ in a layer, stating the chain-rule structure and what gets cached on the forward pass.

Accepted Answer

Backprop applies the chain rule: \frac{\partial L}{\partial w_{ij}}=\frac{\partial L}{\partial z_j}\cdot\frac{\partial z_j}{\partial w_{ij}}=\delta_j a_i, where z_j=\sum_i w_{ij}a_i+b_j, a_i is the input activation, and \delta_j=\frac{\partial L}{\partial z_j} is the error signal. \delta propagates backward: \delta_j=\big(\sum_k w_{jk}\delta_k\big)\,g'(z_j) for hidden units, where g' is the activation derivative. The forward pass caches activations a_i and pre-activations z_j for the backward pass. The full layer gradient is the outer product \delta a^	op; cost is the same order as the forward pass.

Question 6

Explain SGD with momentum. What problem does momentum solve and how does the velocity term work?

Accepted Answer

Plain SGD updates 	heta\leftarrow	heta-\eta
abla L; it zig-zags in ravines (high curvature in one direction, low in another) and crawls through flat/noisy regions. Momentum maintains a velocity v\leftarrow\beta v+
abla L, then 	heta\leftarrow	heta-\eta v, with \beta\approx0.9. It accumulates consistent gradient directions while canceling oscillating components, dampening zig-zag and accelerating along persistent slopes — like a heavy ball with inertia. Nesterov momentum evaluates the gradient at the look-ahead point 	heta-\eta\beta v, giving a correction that often converges faster and more stably.

Question 7

Walk through the Adam optimizer's update equations and explain the role of bias correction.

Accepted Answer

Adam keeps EMAs of the gradient (first moment m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t) and squared gradient (second moment v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2). Because m,v initialize at zero they're biased toward zero early on, so bias-correct: \hat m_t=m_t/(1-\beta_1^t), \hat v_t=v_t/(1-\beta_2^t). Update: 	heta\leftarrow	heta-\eta\,\hat m_t/(\sqrt{\hat v_t}+\epsilon). This gives per-parameter adaptive step sizes. Defaults: \beta_1{=}0.9,\beta_2{=}0.999,\epsilon{=}10^{-8}. Without bias correction the first steps would be far too small.

Question 8

What is the actual difference between Adam and AdamW, and why does it matter?

Accepted Answer

Adam with L2 'weight decay' folds the penalty into the gradient (g+\lambda	heta), which then passes through the adaptive denominator \sqrt{\hat v} — so parameters with large gradient variance get *less* effective decay, coupling regularization to gradient scale. AdamW decouples weight decay from the gradient: 	heta\leftarrow	heta-\eta(\hat m/(\sqrt{\hat v}+\epsilon)+\lambda	heta), shrinking weights directly and uniformly. This makes decay behave as intended, improves generalization, and decouples optimal learning rate from optimal decay. AdamW is the de-facto standard for training transformers and large models.

Question 9

Compare He (Kaiming) and Xavier (Glorot) initialization. When do you use each and what variance do they target?

Accepted Answer

Both keep activation/gradient variance roughly constant across layers to avoid vanishing/exploding signals. Xavier/Glorot targets 	ext{Var}(W)=\frac{2}{n_{in}+n_{out}} (or \frac{1}{n_{in}}), derived assuming a linear/tanh-like, zero-centered, symmetric activation. He init targets 	ext{Var}(W)=\frac{2}{n_{in}}, doubling the variance to account for ReLU zeroing half the inputs (halving variance). Rule of thumb: Xavier for tanh/sigmoid, He for ReLU/variants. Wrong init in deep nets causes signals to shrink or blow up exponentially with depth, stalling or destabilizing training before normalization/residuals can help.

Question 10

BatchNorm behaves differently during training and inference. Explain precisely what statistics are used in each phase and why the switch is necessary.

Accepted Answer

During training, BN normalizes each activation using the mean and variance of the current mini-batch, then applies learnable scale \gamma and shift \beta. It simultaneously maintains an exponential moving average of batch mean/variance (the running stats). At inference, the batch is gone (you may even predict single examples), so BN uses these fixed running stats instead of per-batch statistics. Without the switch, inference output would depend on whichever other examples happened to share the batch, making predictions non-deterministic and batch-composition-dependent. In frameworks you toggle this with model.eval() / training=False; forgetting it is a classic silent bug.

Question 11

Define what BatchNorm normalizes over versus LayerNorm versus GroupNorm versus InstanceNorm, given a tensor of shape (N, C, H, W). Be explicit about which axes are reduced.

Accepted Answer

For (N,C,H,W): BatchNorm computes one mean/var per channel, reducing over (N,H,W) — so statistics depend on the batch. LayerNorm (vision form) reduces over (C,H,W) per example, i.e. per sample across all features, batch-independent. InstanceNorm reduces over (H,W) for each (N,C) pair — per-channel, per-example (used in style transfer). GroupNorm splits the C channels into G groups and reduces over (C/G, H, W) per example, interpolating between LayerNorm (G{=}1) and InstanceNorm (G{=}C). All four then apply learnable per-channel \gamma,\beta. Only BN couples examples within a batch, which is the root of its train/inference behavior split.

Question 12

Why is BatchNorm typically placed before the nonlinearity and how does it interact with the bias term of the preceding linear/conv layer?

Accepted Answer

Placing BN before the activation (the original proposal) normalizes the pre-activation so the nonlinearity operates in a well-conditioned regime — keeping inputs near the active part of, say, a sigmoid/tanh and stabilizing the variance feeding into ReLU. Because BN subtracts the batch mean, any bias added by the preceding linear/conv layer is immediately removed by the centering step, so that bias is redundant — you set bias=False on the layer feeding BN, and BN's own learnable \beta serves as the effective bias. (Post-activation BN also works and some find it competitive, but pre-activation is the canonical placement and the bias-elimination argument is the practical takeaway.)

Question 13

Why does batch normalization help training, and what subtle problems does it have at inference and small batch sizes?

Accepted Answer

BatchNorm normalizes each feature over the minibatch to zero-mean/unit-variance, then rescales with learned \gamma,\beta. It smooths the loss landscape (the modern explanation, vs the older 'internal covariate shift' story), permits higher learning rates, and adds mild regularizing noise. Problems: it couples examples in a batch (one example's output depends on others), so it behaves differently at inference — you must use running-average statistics, creating a train/test discrepancy. With small or size-1 batches the batch statistics are noisy or undefined, degrading it badly; that's where LayerNorm/GroupNorm are preferred (e.g., transformers, RNNs).

Question 14

Contrast LayerNorm and RMSNorm. Why has RMSNorm become popular in large transformers?

Accepted Answer

LayerNorm normalizes over the feature dimension of a single token: subtract the mean, divide by std, then scale+shift with \gamma,\beta — per-example, so no batch coupling (ideal for sequences/transformers). RMSNorm drops the mean-centering and bias: it divides by the root-mean-square \sqrt{\frac1d\sum x_i^2} and scales by \gamma only. RMSNorm is cheaper (no mean, no subtraction, fewer params) and empirically matches LayerNorm quality, so it's the default in many recent LLMs (e.g., the LLaMA family). The implicit claim is that re-centering contributes little; only re-scaling matters for stable training.

Question 15

At inference, dropout is disabled. Explain the inverted-dropout scaling trick and why it preserves expected activations.

Accepted Answer

During training dropout zeros each unit with probability p (keep prob 1-p), so the expected sum of activations shrinks by factor (1-p). Inverted dropout divides surviving activations by (1-p) *at training time*, restoring the expected magnitude so it matches the no-dropout case. Then at test time you do nothing — use the full network as-is, with no scaling. The alternative (scale at test by (1-p)) is equivalent in expectation, but inverted dropout is preferred because the deployed inference path stays a clean, unmodified forward pass. Dropout acts as an implicit ensemble over subnetworks.

Question 16

Explain the vanishing-gradient problem in deep nets mathematically, and list three architectural mitigations.

Accepted Answer

Backprop multiplies Jacobians layer by layer: \frac{\partial L}{\partial h_1}=\prod_{l}\frac{\partial h_{l+1}}{\partial h_l}. If each factor has spectral norm <1 (e.g., saturating sigmoids, derivative \le0.25), the product shrinks exponentially with depth, so early layers get near-zero gradients and barely learn; norms >1 explode instead. Mitigations: (1) non-saturating activations (ReLU/GELU) keeping derivatives near 1; (2) residual/skip connections adding an identity path so the Jacobian is I+\dots, preserving gradient flow; (3) normalization (Batch/Layer) keeping pre-activations well-conditioned. Also careful init (He/Xavier) and gradient clipping for the exploding case.

Question 17

Why do residual connections enable training very deep networks? Address both optimization and the identity-mapping argument.

Accepted Answer

A residual block computes y=x+F(x). The Jacobian is I+\partial F/\partial x, so gradients flow back through the identity term even when \partial F is small — preventing vanishing gradients and giving the loss surface fewer pathological flat regions (empirically much smoother). The identity-mapping argument: it's easier to learn a residual F(x)\approx0 (block defaults to identity) than to learn the full mapping from scratch, so adding depth never *degrades* a network that's already good — extra blocks can no-op. This is why 100+ and 1000+ layer ResNets train, unlike plain deep nets that degrade.

Question 18

Implement vanilla SGD-with-momentum and the AdamW update in NumPy-style pseudocode, highlighting the key per-step state.

Accepted Answer

# momentum  v = beta*v + grad; theta -= lr*v. # AdamW (t=step):  m = b1*m + (1-b1)*grad; v = b2*v + (1-b2)*grad**2; mhat = m/(1-b1**t); vhat = v/(1-b2**t); theta -= lr*(mhat/(np.sqrt(vhat)+eps) + wd*theta). State per parameter: momentum needs one buffer v; AdamW needs two (m,v) plus the step counter t for bias correction. Crucially, in AdamW weight decay (wd*theta) is added *outside* the adaptive term, not folded into grad. Initialize all buffers to zero. Typical: lr~3e-4, b1=0.9, b2=0.999, wd~0.01.

Question 19

Compare L1 vs L2 regularization geometrically and in their gradient effect. Why does L1 induce sparsity?

Accepted Answer

L2 (\lambda\|w\|_2^2) has gradient 2\lambda w, shrinking weights proportionally — they get small but rarely exactly zero (the penalty's contour is a smooth ball). L1 (\lambda\|w\|_1) has constant-magnitude (sub)gradient \lambda\,	ext{sign}(w), pushing weights toward zero by a fixed amount each step regardless of size, so small weights hit and stay at exactly zero — yielding sparse, feature-selecting solutions. Geometrically, the L1 constraint region is a diamond whose corners lie on axes, so the loss contour typically first touches it at a corner (a zero coordinate). L2 = weight decay, smooth, good for conditioning; L1 = sparsity. ElasticNet combines both.

Question 20

Why do Transformers use LayerNorm instead of BatchNorm, even though BN works well in CNNs?

Accepted Answer

BN normalizes each feature across the batch dimension, so it needs a stable, reasonably large batch and an i.i.d. notion of "the same feature across examples." In sequence models, variable sequence lengths, padding, small effective batches, and autoregressive single-token decoding make batch statistics unstable and inference awkward (you'd need correct running stats per position). LayerNorm instead normalizes across the feature dimension within a single token, independent of batch size and sequence length, giving identical behavior at train and inference and no running-stat bookkeeping. It also handles the highly variable activation scales across token positions better, which is why it became the Transformer default.

Question 21

RMSNorm is now standard in modern LLMs (LLaMA, etc.) instead of LayerNorm. Write its formula, contrast it with LayerNorm, and explain why the simplification is justified.

Accepted Answer

LayerNorm computes y=\gamma\odot\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta over the feature dim. RMSNorm drops the mean-centering and the bias: y=\gamma\odot\frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}}, i.e. it rescales by the root-mean-square only. The empirical claim is that LN's benefit comes mainly from the re-scaling (variance normalization), not the re-centering, so removing the mean subtraction costs little accuracy while cutting one reduction pass and the \beta parameter. At LLM scale that yields a measurable throughput/memory win per layer with comparable or better stability, which is why it became the default in many recent architectures.

Question 22

BatchNorm degrades sharply with very small batch sizes. Explain the mechanism and name two alternatives that fix it.

Accepted Answer

BN estimates per-feature mean and variance from the mini-batch. With small batches (e.g. 1-4, common in detection/segmentation with large images), these estimates are high-variance and noisy, so the normalization itself becomes a noisy, batch-composition-dependent operation, and the running-stat EMA accumulates from poor samples — hurting both training stability and the train/inference statistics mismatch. The variance estimate is especially unreliable. Fixes: GroupNorm, which normalizes over channel groups within a single example (batch-independent, the standard detection fix); and LayerNorm/InstanceNorm, also per-example. SyncBatchNorm (aggregating stats across GPUs) helps when the small batch is only per-device but the global batch is large.

Question 23

Why do warmup-then-decay learning-rate schedules (e.g., linear warmup + cosine decay) help large transformer training specifically?

Accepted Answer

Early in training, weights and Adam's second-moment estimates are poorly conditioned and gradients are noisy; a full learning rate then can cause divergence — especially with large batches and adaptive optimizers whose variance estimates are unreliable in the first steps. Linear warmup ramps the LR from ~0 over a few thousand steps, letting moment estimates stabilize and avoiding early large steps that blow up LayerNorm/residual scales. Cosine (or inverse-sqrt) decay then anneals the LR to refine into sharper minima and reduce late-stage noise. The combination consistently improves stability and final loss for deep/large models.

Question 24

A practitioner reports Adam converges faster than SGD on the training set but SGD+momentum generalizes better at test time. Why might this happen?

Accepted Answer

Adam's per-coordinate adaptive scaling dives quickly into the nearest minimum, but tends to find *sharper* minima and can over-adapt to gradient noise, often generalizing slightly worse — the documented adaptive-methods generalization gap. SGD's isotropic, noisier updates have an implicit-regularization bias toward *flatter*, wider minima, and its noise scale (tied to LR/batch) acts like a regularizer. Mitigations: AdamW (decoupled decay) closes much of the gap; switching Adam→SGD late in training; tuning weight decay and LR. It's not universal — for transformers AdamW usually wins outright — but the flat-minima/implicit-bias argument explains the classic CNN observation.

Question 25

Why does the softmax-cross-entropy gradient simplify to $p-y$, and why is this property numerically and optimization-wise valuable?

Accepted Answer

With logits z, softmax p_i=e^{z_i}/\sum e^{z_j}, and loss L=-\sum_k y_k\log p_k. Differentiating: \partial L/\partial z_i=\sum_k(-y_k/p_k)(\partial p_k/\partial z_i), and the softmax Jacobian \partial p_k/\partial z_i=p_k(\delta_{ki}-p_i) collapses the sum (using \sum_k y_k=1) to p_i-y_i. This is valuable because it's a clean, bounded residual: no activation-derivative factor that could vanish (unlike sigmoid+MSE, which carries a saturating \sigma' term), the gradient magnitude scales with prediction error, and it's cheap. Computing log-softmax directly (with the max-subtraction trick) keeps it numerically stable.

Question 26

You stack BatchNorm before a ReLU inside a residual block and training is unstable at high LR with large batches. Diagnose plausible causes and fixes a staff engineer would consider.

Accepted Answer

Several interacting issues: (1) BN's batch statistics are a moving target with very large batches/high LR, and BN inside residual branches can amplify variance across depth — fix with the zero-init last-BN-gamma trick so each block starts as identity, stabilizing early training. (2) High LR + large batch needs LR warmup and possibly linear LR scaling with batch size; without warmup the first steps destabilize BN running stats. (3) BN train/inference statistic mismatch hurts if batches are non-i.i.d. (4) Consider gradient clipping, switching to GroupNorm/LayerNorm (batch-independent), or LARS/LAMB optimizers designed for large-batch regimes. (5) Check BN placement (pre-activation ordering) and that residual scale isn't compounding.

Question 27

GELU has largely replaced ReLU in transformers. What does GELU compute, and why might its smoothness matter for these models?

Accepted Answer

GELU (Gaussian Error Linear Unit) is x\cdot\Phi(x), where \Phi is the standard-normal CDF — a smooth gate weighting the input by the probability it exceeds a Gaussian threshold; often approximated as 0.5x(1+	anh[\sqrt{2/\pi}(x+0.044715x^3)]). Unlike ReLU's hard zero cutoff, GELU is differentiable everywhere with a small negative-region response, so it passes a little gradient for slightly-negative inputs (no hard 'dead' units) and gives a smoother loss landscape. The smoothness/stochastic-gating interpretation empirically improves optimization for transformer-scale models; SiLU/Swish (x\sigma(x)) is a closely related smooth alternative.

Question 28

BatchNorm was originally motivated by reducing 'internal covariate shift.' Why is that explanation now considered a myth, and what is the better-supported account of why BN helps?

Accepted Answer

The ICS story claimed BN works by stabilizing the distribution of each layer's inputs as earlier layers update. Santurkar et al. (2018) showed empirically you can inject noise after BN to deliberately re-introduce distribution shift and still get BN's benefits, and that BN doesn't even reliably reduce ICS. The better-supported account: BN smooths the optimization landscape — it improves the Lipschitzness of the loss and gradients (smaller, more predictive gradient steps), which permits higher learning rates and faster, more stable convergence. The reparameterization decoupling activation scale from weight scale is the real lever, not distributional stabilization.

Question 29

In a residual Transformer block, contrast Pre-LN vs Post-LN placement. Why did large models largely move to Pre-LN (or variants), and what's the tradeoff?

Accepted Answer

Post-LN (original Transformer) applies LayerNorm after the residual add: x_{l+1}=	ext{LN}(x_l+	ext{Sublayer}(x_l)). Pre-LN normalizes inside the branch: x_{l+1}=x_l+	ext{Sublayer}(	ext{LN}(x_l)). Post-LN puts LN on the residual path, so gradients are repeatedly rescaled and can explode/vanish with depth, forcing careful learning-rate warmup and making deep stacks unstable. Pre-LN keeps a clean identity residual path, so gradients flow straight through and very deep models train stably with less warmup — hence its adoption at scale. The tradeoff: Pre-LN can slightly underperform a well-tuned Post-LN and tends to grow residual-stream magnitude with depth, motivating fixes like final-LN, DeepNorm, or sandwich/QK-norm.

Question 30

You fine-tune a pretrained CNN, freezing most layers, but validation accuracy is far worse than training accuracy in a way regularization doesn't explain. How could BatchNorm be the culprit, and how do you fix it?

Accepted Answer

Two classic BN pitfalls. (1) You left the model in train() mode at eval, so BN used batch stats from the (possibly tiny or differently-distributed) validation batches instead of running stats — call model.eval(). (2) When freezing for transfer learning, you froze \gamma,\beta but left BN in training mode so its running stats kept updating toward the new small dataset's distribution, or conversely the frozen running stats from pretraining don't match your data. Fix: explicitly set BN layers to eval mode (freeze running stats) when freezing the backbone, or recalibrate running stats on the new data. The train/inference stat mismatch — not overfitting — drives the gap.

Question 31

Derive the backward pass for BatchNorm: given upstream gradient $\partial L/\partial \hat{x}_i$ on the normalized activations, express the gradient with respect to the inputs $x_i$. Why is this more than just dividing by the std?

Accepted Answer

With \hat{x}_i=(x_i-\mu)/\sqrt{\sigma^2+\epsilon}, \mu and \sigma^2 both depend on every x_i, so the chain rule has three paths. Letting g_i=\partial L/\partial\hat{x}_i and \sigma_\epsilon=\sqrt{\sigma^2+\epsilon}: \frac{\partial L}{\partial x_i}=\frac{1}{m\,\sigma_\epsilon}\Big(m\,g_i-\sum_j g_j-\hat{x}_i\sum_j g_j\hat{x}_j\Big). The two subtracted terms re-center and re-scale the incoming gradient because the batch mean and variance are themselves functions of x_i. This is exactly why BN couples examples in a batch at train time — the gradient for one example depends on all others — and why naive single-example normalization at inference would not match training dynamics.

Deep Learning Fundamentals & Training