Q: How does batch normalization behave differently at training vs inference in a CNN, and over what axes is it computed?

In a CNN, BatchNorm normalizes per channel over the batch and spatial dimensions (N, H, W), so each channel gets one mean/variance — preserving the conv's spatial parameter sharing. At training it uses the current mini-batch statistics and updates running estimates via exponential moving average. At inference it uses those fixed running mean/variance (no batch dependence), so outputs are deterministic and independent of batch composition. A train/eval mode mismatch (forgetting to switch) is a classic bug causing poor or unstable inference, especially with small batches.

Q: In transfer learning with a pretrained CNN, when should you freeze the backbone vs fine-tune all layers, and why?

Freeze the backbone (train only a new head) when your dataset is small and similar to the pretraining domain — early/mid features are generic (edges, textures, parts) and tuning them on little data overfits. Fine-tune all layers when you have enough data and/or a domain shift, lowering the learning rate (often with discriminative/layer-wise LRs) so pretrained features aren't destroyed. A common middle ground: freeze early layers, fine-tune later (more task-specific) ones. Keep BatchNorm running stats in mind — freezing BN or using a small LR avoids corrupting statistics on a tiny target set.

Q: Compute the parameter count of a conv layer with 64 input channels, 128 output channels, 3x3 kernels, and bias.

A conv layer's weights are (k_h \times k_w \times C_{in}) \times C_{out} plus one bias per output channel: (3\times3\times64)\times128 + 128 = 576\times128 + 128 = 73728 + 128 = 73856 parameters. Note this is independent of input spatial size — that's parameter sharing. By contrast a fully-connected layer over even a modest feature map would have orders of magnitude more parameters, which is the core efficiency argument for convolution.

Q: Contrast the two-stage R-CNN family with single-stage detectors like YOLO and SSD on the accuracy/speed tradeoff.

Two-stage detectors (Fast/Faster R-CNN) first propose regions (RPN) then classify+regress each, giving high accuracy especially on small/overlapping objects but slower inference. Single-stage detectors (YOLO, SSD) predict class and box directly over a dense grid/anchor set in one pass — much faster and real-time capable, historically slightly less accurate on small objects. SSD uses multi-scale feature maps; YOLO frames detection as regression over a grid. RetinaNet narrowed the gap via focal loss to fix foreground/background class imbalance, the main weakness of dense single-stage prediction.

Q: Explain the U-Net architecture and why skip connections are critical for segmentation quality.

U-Net is an encoder–decoder: the contracting path downsamples to capture semantic context, the expanding path upsamples (via transposed convs or interpolation) to recover resolution for per-pixel labels. Skip connections concatenate encoder feature maps into the matching decoder stage. They're critical because downsampling discards precise spatial/boundary information; the skips reinject high-resolution, low-level detail so the decoder can place sharp object boundaries instead of blurry blobs. They also ease gradient flow. This is why U-Net excels with limited data (e.g., biomedical) where accurate edges matter.

Question 1

What is a convolution in a CNN, and why is it preferred over a fully-connected layer for images?

Accepted Answer

A convolution slides a small learnable kernel across the spatial input, computing a dot product at each position to produce a feature map. It exploits two priors: locality (pixels near each other are correlated, so small kernels suffice) and translation equivariance via parameter sharing (the same weights detect a feature anywhere). Versus a fully-connected layer it has far fewer parameters (kernel size, not image size), generalizes better, and a shift in the input shifts the output rather than scrambling it.

Question 2

Define stride and padding, and give the formula for the output spatial size of a conv layer.

Accepted Answer

Stride s is the step the kernel moves between positions; larger stride downsamples. Padding p adds border pixels (often zeros) so edge pixels are covered and spatial size can be preserved. For input size W, kernel k, padding p, stride s, the output is \lfloor (W - k + 2p)/s \rfloor + 1. 'Same' padding picks p to keep output equal to input (at stride 1, p=(k-1)/2 for odd k); 'valid' padding uses p=0 and shrinks the map.

Question 3

What is pooling, and how do max pooling and average pooling differ in effect?

Accepted Answer

Pooling downsamples a feature map by aggregating over local windows, giving a degree of translation invariance and reducing spatial resolution and compute. Max pooling takes the maximum, preserving the strongest activation (sharp edges/textures) and acting as a feature detector; it backprops gradient only to the winning unit. Average pooling smooths by taking the mean, which can dilute strong responses but is robust to noise. Modern nets often replace pooling with strided convolutions; global average pooling is widely used before the classifier to remove fully-connected layers.

Question 4

Why does ResNet's residual connection enable training of very deep networks?

Accepted Answer

A residual block computes y = F(x) + x, so the layer only has to learn the residual F(x) relative to the identity. This makes the identity mapping trivial to represent (set F=0), avoiding the degradation problem where deeper plain nets get worse training error. The skip path gives gradients an unobstructed route backward — \partial y/\partial x = I + \partial F/\partial x — so the +I term keeps gradient magnitude from vanishing through many layers. This let ResNet train 100+ and 1000+ layer networks where plain stacks failed.

Question 5

Compute the receptive field of a single unit after three stacked 3x3 conv layers (stride 1, no dilation).

Accepted Answer

Receptive field grows additively for stride-1 stacks: r_L = r_{L-1} + (k_L - 1)\prod_{i<L} s_i, with stride product 1 here. Starting at r_0=1: after one 3x3 layer r=3, after two r=5, after three r=7. So each unit sees a 7	imes7 region. This is why VGG stacks three 3x3 convs instead of one 7x7: same receptive field, fewer parameters (3\cdot 9 = 27 vs 49 per channel-pair), and more nonlinearities.

Question 6

How does batch normalization behave differently at training vs inference in a CNN, and over what axes is it computed?

Accepted Answer

In a CNN, BatchNorm normalizes per channel over the batch and spatial dimensions (N, H, W), so each channel gets one mean/variance — preserving the conv's spatial parameter sharing. At training it uses the current mini-batch statistics and updates running estimates via exponential moving average. At inference it uses those fixed running mean/variance (no batch dependence), so outputs are deterministic and independent of batch composition. A train/eval mode mismatch (forgetting to switch) is a classic bug causing poor or unstable inference, especially with small batches.

Question 7

In transfer learning with a pretrained CNN, when should you freeze the backbone vs fine-tune all layers, and why?

Accepted Answer

Freeze the backbone (train only a new head) when your dataset is small and similar to the pretraining domain — early/mid features are generic (edges, textures, parts) and tuning them on little data overfits. Fine-tune all layers when you have enough data and/or a domain shift, lowering the learning rate (often with discriminative/layer-wise LRs) so pretrained features aren't destroyed. A common middle ground: freeze early layers, fine-tune later (more task-specific) ones. Keep BatchNorm running stats in mind — freezing BN or using a small LR avoids corrupting statistics on a tiny target set.

Question 8

Compute the parameter count of a conv layer with 64 input channels, 128 output channels, 3x3 kernels, and bias.

Accepted Answer

A conv layer's weights are (k_h 	imes k_w 	imes C_{in}) 	imes C_{out} plus one bias per output channel: (3	imes3	imes64)	imes128 + 128 = 576	imes128 + 128 = 73728 + 128 = 73856 parameters. Note this is independent of input spatial size — that's parameter sharing. By contrast a fully-connected layer over even a modest feature map would have orders of magnitude more parameters, which is the core efficiency argument for convolution.

Question 9

Contrast the two-stage R-CNN family with single-stage detectors like YOLO and SSD on the accuracy/speed tradeoff.

Accepted Answer

Two-stage detectors (Fast/Faster R-CNN) first propose regions (RPN) then classify+regress each, giving high accuracy especially on small/overlapping objects but slower inference. Single-stage detectors (YOLO, SSD) predict class and box directly over a dense grid/anchor set in one pass — much faster and real-time capable, historically slightly less accurate on small objects. SSD uses multi-scale feature maps; YOLO frames detection as regression over a grid. RetinaNet narrowed the gap via focal loss to fix foreground/background class imbalance, the main weakness of dense single-stage prediction.

Question 10

Explain the U-Net architecture and why skip connections are critical for segmentation quality.

Accepted Answer

U-Net is an encoder–decoder: the contracting path downsamples to capture semantic context, the expanding path upsamples (via transposed convs or interpolation) to recover resolution for per-pixel labels. Skip connections concatenate encoder feature maps into the matching decoder stage. They're critical because downsampling discards precise spatial/boundary information; the skips reinject high-resolution, low-level detail so the decoder can place sharp object boundaries instead of blurry blobs. They also ease gradient flow. This is why U-Net excels with limited data (e.g., biomedical) where accurate edges matter.

Question 11

How does Mask R-CNN extend Faster R-CNN to instance segmentation, and what problem does RoIAlign solve?

Accepted Answer

Mask R-CNN adds a small FCN mask-prediction branch in parallel with the existing box classification/regression heads, outputting a per-class binary mask for each RoI — giving per-instance segmentation. The key fix is RoIAlign replacing RoIPool: RoIPool quantizes RoI coordinates to the feature grid, introducing misalignments that are harmless for boxes but ruin pixel-accurate masks. RoIAlign uses bilinear interpolation at exact (non-quantized) sampling points, preserving spatial correspondence. Masks are predicted per class and decoupled from classification, which empirically beats class-competitive masks.

Question 12

What does EfficientNet's compound scaling do, and why is scaling depth, width, and resolution jointly better than scaling one alone?

Accepted Answer

Compound scaling uniformly grows network depth (d=\alpha^\phi), width (w=\beta^\phi), and input resolution (r=\gamma^\phi) with a single coefficient \phi, where \alpha\beta^2\gamma^2\approx2 keeps FLOPs roughly 2^\phi. Scaling one dimension alone saturates: more depth without resolution can't use fine detail; higher resolution without depth/width lacks capacity to process it. Balancing all three matches representational capacity to input information, giving better accuracy per FLOP. EfficientNet pairs this with a mobile-inverted-bottleneck (MBConv) base found by neural architecture search.

Question 13

A 1x1 convolution seems to do nothing spatially — what does it actually compute, and why is it used in Inception/ResNet bottlenecks?

Accepted Answer

A 1x1 conv is a learned linear projection across channels at each spatial location: it mixes/reweights channel features and changes channel count without touching spatial extent. In Inception and ResNet bottlenecks it reduces channels before an expensive 3x3 conv and restores them after, cutting FLOPs/parameters dramatically (a '3x3 sandwiched between two 1x1s'). It also adds a nonlinearity (with activation) and acts as cross-channel feature pooling. Network-in-Network introduced it; it's the cheapest way to control depth and increase representational mixing.

Question 14

Implement, in NumPy-style pseudocode, a single-channel 2D valid convolution and state its time complexity.

Accepted Answer

def conv2d(X, K):  # X: HxW, K: khxkw then H,W=X.shape; kh,kw=K.shape; oh=H-kh+1; ow=W-kw+1; Y=zeros((oh,ow)); for i in range(oh): for j in range(ow): Y[i,j]=sum(X[i:i+kh, j:j+kw]*K); return Y. Note this is technically cross-correlation (no kernel flip), which is what deep-learning 'convolution' computes. Complexity is O(oh\cdot ow\cdot kh\cdot kw) per output channel; for C_{in}	imes C_{out} channels it's O(C_{in}C_{out}\,H\,W\,kh\,kw). Real implementations use im2col+GEMM or FFT/Winograd to exploit BLAS and reduce multiplies.

Question 15

Why do Vision Transformers need either large pretraining data or strong inductive-bias injection to match CNNs, and where do ViTs win?

Accepted Answer

ViTs split an image into patch tokens and use global self-attention, so they lack the CNN's built-in locality and translation-equivariance priors. With those priors absent, ViTs must learn spatial structure from data, so on mid-size datasets (e.g. ImageNet-1k from scratch) they underperform CNNs; they catch up and surpass them only with large-scale pretraining (JFT/ImageNet-21k) or bias-injecting recipes (DeiT distillation, convolutional stems, hybrid/Swin's local windows). ViTs win on long-range dependencies, scaling with data/compute, and unified multimodal architectures; CNNs remain strong in low-data and high-resolution-efficiency regimes.

Question 16

You train a ResNet with BatchNorm and standard augmentation; train accuracy is 99% but validation accuracy oscillates wildly and is far lower. Diagnose the likely causes and fixes.

Accepted Answer

The gap signals overfitting and/or train/eval inconsistency. Likely causes: (1) BatchNorm using running stats poorly estimated from too-small batches, or model left in train mode at eval — fix by ensuring eval mode, larger batches, or GroupNorm. (2) Insufficient regularization — add weight decay, dropout, stronger augmentation (RandAugment, Mixup/CutMix), early stopping. (3) Data leakage or a train/val distribution mismatch — verify splits and that augmentation isn't applied to validation. (4) Learning rate too high near convergence causing oscillation — use LR decay/warmup. Check whether the val curve's noise correlates with LR and BN behavior.

Question 17

Data augmentations like Mixup and CutMix combine images and labels. Why does training on these 'unrealistic' samples improve generalization and calibration?

Accepted Answer

Mixup trains on convex combinations of input pairs and their labels; CutMix pastes a patch of one image into another and mixes labels by area. They act as a strong regularizer by enforcing linear/locally-smooth behavior between examples, expanding the data manifold's support and discouraging memorization of sharp decision boundaries. The soft mixed labels reduce overconfidence, improving calibration and robustness to label noise and adversarial perturbation. CutMix specifically keeps informative local regions (better than Mixup's global blend for localization) while still mixing labels, forcing the model to attend to multiple object parts rather than one discriminative cue.

CNNs & Computer Vision