Q: Distinguish SFT from instruction tuning — are they the same thing?

SFT (supervised fine-tuning) is the general mechanism: continue next-token training on curated (prompt, target) pairs so the loss is computed (usually) only on the response tokens. Instruction tuning is a specific application of SFT where the data is a broad mix of tasks phrased as instructions with desired outputs, aiming to make the model follow arbitrary natural-language directives (generalize to unseen instructions). So instruction tuning is SFT, but not all SFT is instruction tuning — you can SFT on a single narrow task (e.g. classification) without any instruction-following goal.

Q: State the Chinchilla scaling result and contrast it with the earlier Kaplan compute-optimal recipe.

Hoffmann et al. (Chinchilla, 2022) found that for a fixed compute budget C \approx 6ND, loss is minimized when parameters N and training tokens D scale roughly equally — about 20 tokens per parameter — so most prior models (GPT-3, Gopher) were badly undertrained. Chinchilla (70B, 1.4T tokens) beat the 280B Gopher. This corrected the earlier Kaplan et al. (2020) recommendation to grow model size far faster than data. The practical lesson: at fixed FLOPs, smaller models trained on more data win, and inference cost favors that even more.

Q: Compute roughly how many training tokens a Chinchilla-optimal 13B model wants, and the FLOPs that implies.

Chinchilla's rule is ~20 tokens per parameter, so 13\text{B} \times 20 \approx 260\text{B} tokens. Training FLOPs follow C \approx 6ND = 6 \times 13\times10^{9} \times 260\times10^{9} \approx 2.0\times10^{22} FLOPs. (The factor 6 is 2 for the forward pass plus ~4 for the backward pass per parameter per token.) In practice modern models are deliberately trained well past 20:1 — e.g. hundreds or thousands of tokens per parameter — to cut inference cost, accepting a less compute-optimal training point.

Q: Why does data mixture (and deduplication) matter as much as raw token count in pretraining?

Quality and composition shift the loss-vs-data curve, not just its position. Heavy duplicates cause memorization and waste effective tokens, so near-dedup improves generalization per FLOP. Domain proportions (code, math, multilingual, high-quality prose) determine downstream skills — code in the mix improves reasoning; over-weighting low-quality web text hurts. Practitioners up-weight high-signal sources, schedule a high-quality 'annealing' phase late in training, and balance domains to avoid one swamping rare-but-valuable data. The objective is fixed, but the data distribution it's computed over is the real lever on capability.

Q: Explain how LoRA works and why it dramatically reduces trainable parameters without adding inference latency.

LoRA freezes the pretrained weight W_0 and learns a low-rank update \Delta W = BA, where A \in \mathbb{R}^{r\times d}, B \in \mathbb{R}^{d\times r}, with rank r \ll d. The forward pass becomes h = W_0 x + \frac{\alpha}{r} BA x. Only A,B train, cutting trainable params by orders of magnitude (and optimizer-state memory with them). It works because fine-tuning updates have low intrinsic rank. At inference you can merge BA into W_0, so there is zero added latency — unlike adapters, which insert extra sequential layers.

Q: Compare LoRA, adapters, and prefix/prompt tuning along where they inject parameters and their inference-time cost.

Adapters insert small bottleneck MLP layers between transformer sublayers — extra sequential compute at inference. LoRA adds a parallel low-rank delta to existing weight matrices — mergeable, so no inference overhead. Prefix/prompt tuning prepends trainable continuous 'virtual token' vectors to the keys/values (prefix) or input embeddings (prompt) and trains only those, leaving all weights frozen — cheapest to store but it consumes context length and tends to underperform LoRA on harder tasks. All are PEFT; LoRA dominates in practice for its quality/no-latency tradeoff, while prompt tuning shines for many cheaply-swappable tasks.

Q: Walk through the classic three-stage RLHF pipeline (reward model + PPO).

Stage 1: SFT on demonstration data to get a competent instruction-follower. Stage 2: collect human preference comparisons (A vs B) and train a reward model r_\phi, typically with the Bradley-Terry loss -\log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)) on chosen/rejected pairs. Stage 3: optimize the policy with PPO to maximize expected reward minus a per-token KL penalty \beta\,\mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) against the frozen SFT model, which prevents reward hacking and keeps generations on-distribution. The KL term and clipping are what keep RLHF stable.

Q: What is RLAIF, and how does Constitutional AI use it?

RLAIF (RL from AI Feedback) replaces human preference labels with an LLM judge that ranks or critiques responses, scaling alignment data cheaply. Anthropic's Constitutional AI is a concrete instance: in the SFT phase the model critiques and revises its own outputs against a written 'constitution' of principles, producing improved training targets; in the RL phase a model generates preference labels by judging pairs against those principles, and that AI feedback trains the reward signal (so it's RL from AI Feedback, RLAIF). The constitution makes the value targets explicit, auditable, and steerable without per-example human labeling.

Q: What is catastrophic forgetting in fine-tuning, and what mitigations exist?

Catastrophic forgetting is the loss of previously-learned capabilities when fine-tuning shifts weights toward the new task distribution, overwriting the features that supported old behaviors. Mitigations: PEFT (LoRA/adapters) freezes the base, preserving original knowledge; replay/rehearsal mixes in pretraining or general instruction data; lower learning rates and fewer epochs; regularization toward the base (KL or L2, e.g. EWC weighting by Fisher information); and a KL-to-reference penalty in RLHF. The general principle: constrain how far weights drift from the pretrained checkpoint, or keep the original capabilities represented in the training mix.

Question 1

What is the next-token prediction objective, and why is it sufficient to learn general language ability?

Accepted Answer

Autoregressive pretraining minimizes the cross-entropy of predicting token x_t given all prior tokens, i.e. maximizing \sum_t \log p_	heta(x_t \mid x_{<t}). Equivalently it minimizes per-token negative log-likelihood (perplexity). It is sufficient because accurately predicting the next token over diverse web-scale text forces the model to internalize syntax, facts, reasoning patterns, and world structure — the only way to drive the loss down on hard continuations is to model the underlying generative process. It is self-supervised: labels are the text itself, so it scales without human annotation.

Question 2

Distinguish SFT from instruction tuning — are they the same thing?

Accepted Answer

SFT (supervised fine-tuning) is the general mechanism: continue next-token training on curated (prompt, target) pairs so the loss is computed (usually) only on the response tokens. Instruction tuning is a specific application of SFT where the data is a broad mix of tasks phrased as instructions with desired outputs, aiming to make the model follow arbitrary natural-language directives (generalize to unseen instructions). So instruction tuning is SFT, but not all SFT is instruction tuning — you can SFT on a single narrow task (e.g. classification) without any instruction-following goal.

Question 3

State the Chinchilla scaling result and contrast it with the earlier Kaplan compute-optimal recipe.

Accepted Answer

Hoffmann et al. (Chinchilla, 2022) found that for a fixed compute budget C \approx 6ND, loss is minimized when parameters N and training tokens D scale roughly equally — about 20 tokens per parameter — so most prior models (GPT-3, Gopher) were badly undertrained. Chinchilla (70B, 1.4T tokens) beat the 280B Gopher. This corrected the earlier Kaplan et al. (2020) recommendation to grow model size far faster than data. The practical lesson: at fixed FLOPs, smaller models trained on more data win, and inference cost favors that even more.

Question 4

Compute roughly how many training tokens a Chinchilla-optimal 13B model wants, and the FLOPs that implies.

Accepted Answer

Chinchilla's rule is ~20 tokens per parameter, so 13	ext{B} 	imes 20 \approx 260	ext{B} tokens. Training FLOPs follow C \approx 6ND = 6 	imes 13	imes10^{9} 	imes 260	imes10^{9} \approx 2.0	imes10^{22} FLOPs. (The factor 6 is 2 for the forward pass plus ~4 for the backward pass per parameter per token.) In practice modern models are deliberately trained well past 20:1 — e.g. hundreds or thousands of tokens per parameter — to cut inference cost, accepting a less compute-optimal training point.

Question 5

Why does data mixture (and deduplication) matter as much as raw token count in pretraining?

Accepted Answer

Quality and composition shift the loss-vs-data curve, not just its position. Heavy duplicates cause memorization and waste effective tokens, so near-dedup improves generalization per FLOP. Domain proportions (code, math, multilingual, high-quality prose) determine downstream skills — code in the mix improves reasoning; over-weighting low-quality web text hurts. Practitioners up-weight high-signal sources, schedule a high-quality 'annealing' phase late in training, and balance domains to avoid one swamping rare-but-valuable data. The objective is fixed, but the data distribution it's computed over is the real lever on capability.

Question 6

Explain how LoRA works and why it dramatically reduces trainable parameters without adding inference latency.

Accepted Answer

LoRA freezes the pretrained weight W_0 and learns a low-rank update \Delta W = BA, where A \in \mathbb{R}^{r	imes d}, B \in \mathbb{R}^{d	imes r}, with rank r \ll d. The forward pass becomes h = W_0 x + \frac{\alpha}{r} BA x. Only A,B train, cutting trainable params by orders of magnitude (and optimizer-state memory with them). It works because fine-tuning updates have low intrinsic rank. At inference you can merge BA into W_0, so there is zero added latency — unlike adapters, which insert extra sequential layers.

Question 7

Compare LoRA, adapters, and prefix/prompt tuning along where they inject parameters and their inference-time cost.

Accepted Answer

Adapters insert small bottleneck MLP layers between transformer sublayers — extra sequential compute at inference. LoRA adds a parallel low-rank delta to existing weight matrices — mergeable, so no inference overhead. Prefix/prompt tuning prepends trainable continuous 'virtual token' vectors to the keys/values (prefix) or input embeddings (prompt) and trains only those, leaving all weights frozen — cheapest to store but it consumes context length and tends to underperform LoRA on harder tasks. All are PEFT; LoRA dominates in practice for its quality/no-latency tradeoff, while prompt tuning shines for many cheaply-swappable tasks.

Question 8

Walk through the classic three-stage RLHF pipeline (reward model + PPO).

Accepted Answer

Stage 1: SFT on demonstration data to get a competent instruction-follower. Stage 2: collect human preference comparisons (A vs B) and train a reward model r_\phi, typically with the Bradley-Terry loss -\log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)) on chosen/rejected pairs. Stage 3: optimize the policy with PPO to maximize expected reward minus a per-token KL penalty \beta\,\mathrm{KL}(\pi_	heta \,\|\, \pi_{	ext{ref}}) against the frozen SFT model, which prevents reward hacking and keeps generations on-distribution. The KL term and clipping are what keep RLHF stable.

Question 9

What is RLAIF, and how does Constitutional AI use it?

Accepted Answer

RLAIF (RL from AI Feedback) replaces human preference labels with an LLM judge that ranks or critiques responses, scaling alignment data cheaply. Anthropic's Constitutional AI is a concrete instance: in the SFT phase the model critiques and revises its own outputs against a written 'constitution' of principles, producing improved training targets; in the RL phase a model generates preference labels by judging pairs against those principles, and that AI feedback trains the reward signal (so it's RL from AI Feedback, RLAIF). The constitution makes the value targets explicit, auditable, and steerable without per-example human labeling.

Question 10

What is catastrophic forgetting in fine-tuning, and what mitigations exist?

Accepted Answer

Catastrophic forgetting is the loss of previously-learned capabilities when fine-tuning shifts weights toward the new task distribution, overwriting the features that supported old behaviors. Mitigations: PEFT (LoRA/adapters) freezes the base, preserving original knowledge; replay/rehearsal mixes in pretraining or general instruction data; lower learning rates and fewer epochs; regularization toward the base (KL or L2, e.g. EWC weighting by Fisher information); and a KL-to-reference penalty in RLHF. The general principle: constrain how far weights drift from the pretrained checkpoint, or keep the original capabilities represented in the training mix.

Question 11

Explain knowledge distillation for LLMs, and the difference between hard-label and soft-label (logit) distillation.

Accepted Answer

Distillation trains a smaller student to mimic a larger teacher. Hard-label distillation trains the student on the teacher's generated text (sequence-level / behavior cloning) with standard cross-entropy — simple, model-agnostic, the basis of most 'synthetic data' distillation. Soft-label (logit) distillation matches the teacher's full output distribution, minimizing KL between teacher and student logits (often temperature-scaled), transferring 'dark knowledge' about relative token probabilities — far more sample-efficient but requires teacher logits and matched tokenizers. On-policy variants (the student generates, the teacher scores) reduce exposure bias from pure offline distillation.

Question 12

How does a Mixture-of-Experts layer work, and why does it decouple parameter count from per-token compute?

Accepted Answer

An MoE replaces a dense FFN with N expert FFNs plus a router (gating network) that, per token, selects top-k experts (e.g. k=2). Only those k experts run, so FLOPs scale with k, not N — you get a huge total parameter count (capacity/knowledge) at the active compute of a much smaller dense model. The router outputs softmax weights over experts; outputs are the weighted sum of the chosen experts. This sparse activation is why MoE models advertise total vs active parameters separately.

Question 13

What does QLoRA add on top of LoRA, and what is the role of NF4 and double quantization?

Accepted Answer

QLoRA fine-tunes a 4-bit-quantized frozen base model while keeping LoRA adapters in higher precision (bf16), making fine-tuning of large models feasible on a single GPU. NF4 (4-bit NormalFloat) is an information-theoretically-optimal quantization for normally-distributed weights, giving better fidelity than int4 at the same bit width. Double quantization further quantizes the quantization constants, saving extra memory. Paged optimizers handle memory spikes. Crucially, gradients flow through the dequantized base into the LoRA adapters; the base stays frozen and quantized, so quality stays close to 16-bit full LoRA.

Question 14

Derive the DPO objective and explain why it eliminates the explicit reward model.

Accepted Answer

RLHF's KL-constrained reward-maximization has the closed-form optimum \pi^*(y|x) \propto \pi_{	ext{ref}}(y|x)\exp(r(x,y)/\beta), so r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{	ext{ref}}(y|x)} + \beta\log Z(x). Substituting this reward into the Bradley-Terry preference likelihood cancels the intractable partition Z(x), giving DPO's loss -\log\sigma\big(\beta\log\frac{\pi_	heta(y_w|x)}{\pi_{	ext{ref}}(y_w|x)} - \beta\log\frac{\pi_	heta(y_l|x)}{\pi_{	ext{ref}}(y_l|x)}\big). The policy is implicitly its own reward model, so you train directly on preference pairs with a simple supervised loss — no separate RM, no sampling/PPO loop.

Question 15

What are the practical advantages and failure modes of DPO versus PPO-based RLHF?

Accepted Answer

DPO is far simpler and more stable: offline, no reward model, no rollouts, no PPO clipping/value-function tuning, lower compute and variance. Failure modes: it's offline so it only learns from the fixed preference set (no on-policy exploration), can over-optimize and push probability mass off-distribution, is sensitive to the reference model and to noisy/length-biased preferences, and the implicit reward can drift. PPO with an online reward model can keep generating fresh on-policy samples and reward-shape more flexibly, at the cost of a fragile, expensive training loop. Variants (IPO, KTO, online DPO) target DPO's overfitting and offline limits.

Question 16

When would you choose full fine-tuning over PEFT despite PEFT's efficiency?

Accepted Answer

Choose full fine-tuning when you need maximal capability shift that low-rank updates can't capture: large domain adaptation (a very different language, modality, or knowledge regime), when you have abundant high-quality data and compute, when continued pretraining is the goal, or when you'll serve a single specialized model (no need to swap adapters). Full FT can exceed LoRA on hard tasks because it isn't constrained to a low-rank subspace. PEFT wins when you need many task-specific variants, limited VRAM, fast iteration, or preservation of base capabilities. Rule of thumb: small targeted adaptation → PEFT; deep capability or distribution change → full FT.

Question 17

What is the load-balancing problem in MoE training, and how is it addressed?

Accepted Answer

Without intervention the router collapses, sending most tokens to a few favored experts — they get all the gradient, the rest stay undertrained, and capacity is wasted (plus uneven expert load causes dropped tokens under fixed capacity factors). Fixes: an auxiliary load-balancing loss penalizing the product of fraction-of-tokens-routed and mean-router-probability per expert (encouraging uniform usage); expert capacity limits with token dropping/overflow; noisy top-k gating to encourage exploration; and newer auxiliary-loss-free schemes (e.g. DeepSeek-V3) that add per-expert bias terms adjusted to equalize load. Balanced routing is essential for both quality and hardware utilization in distributed expert-parallel training.

Question 18

What is continued (continual) pretraining, and what are the key pitfalls when adapting a base model to a new domain?

Accepted Answer

Continued pretraining resumes the next-token objective on domain or language-specific corpora (medical, legal, code, a new language) to inject knowledge before any SFT. Pitfalls: catastrophic forgetting of general ability (mitigate by mixing in a slice of the original distribution, e.g. 5–30% replay); learning-rate scheduling — restart with a warmup and a peak LR well below the original to avoid destabilizing the checkpoint; data quality and dedup matter even more at smaller scale; tokenizer mismatch for new languages may need vocabulary extension. It precedes, not replaces, instruction tuning and alignment.

Question 19

In LoRA, what do rank r and scaling α control, what is the typical α/r convention, and what does rank tell you about a task?

Accepted Answer

r is the dimensionality (capacity) of the low-rank update — higher r captures more complex adaptations but costs more params; \alpha scales the update via the factor \alpha/r, effectively setting the update's learning-rate magnitude relative to the frozen base. A common convention is \alpha = 2r (or \alpha=r), chosen so the effective scale stays stable as you sweep r. Empirically, simple stylistic/format adaptation needs low rank (4–16); tasks demanding new knowledge or large behavioral change benefit from higher rank or full FT. If raising r keeps helping, the task's update isn't low-rank — a signal to consider full fine-tuning.

Question 20

In PPO-based RLHF, what failure occurs if you remove the per-token KL penalty against the reference policy, and mechanistically why does the KL term prevent it?

Accepted Answer

Without the KL penalty the policy over-optimizes the learned reward and undergoes reward hacking / mode collapse: it drifts far off-distribution into degenerate, high-reward-but-low-quality text (repetition, sycophancy, gibberish the reward model mis-scores) because the reward model is only accurate near \pi_{ref}'s support and is exploitable elsewhere. The KL term -\beta\,\mathrm{KL}(\pi_	heta\|\pi_{ref}) adds a per-token cost for diverging, acting as a trust region / regularizer that keeps the policy in the region where reward estimates are reliable and preserves the base model's fluency and diversity. It trades reward magnitude for staying on-distribution, mitigating Goodhart's law on a proxy reward.

Question 21

Compare DPO and PPO for RLHF on stability, distribution coverage, and susceptibility to reward hacking. When would a staff engineer still choose PPO despite DPO's simplicity?

Accepted Answer

DPO is offline and contrastive: simpler, cheaper, no reward model or rollouts, lower-variance gradients, but it only sees the fixed preference dataset's (y_w,y_l) pairs — it cannot explore, so it inherits and can amplify dataset biases (length, annotator quirks) and can push mass to unobserved OOD outputs. PPO is online: it samples fresh completions scored by an explicit reward model, giving broader coverage and an explicit KL trust region, which often yields higher peak quality and lets reward-model errors be caught/iterated — at the cost of instability, hyperparameter sensitivity, and compute. Choose PPO when you have a strong reusable reward model, want on-policy exploration beyond the static preference set, need to combine multiple reward signals (safety + helpfulness), or DPO has plateaued/over-optimized. Reward hacking hits both, but PPO's explicit RM is the more classically exploitable proxy.

Question 22

In a top-k MoE layer, derive why the model gets far more parameters than a dense model at roughly the same per-token FLOPs, and state precisely what does and doesn't scale. Use Mixtral-style numbers (8 experts, top-2) to ground the answer.

Accepted Answer

Total parameters scale with the number of experts E (each expert is a full FFN), but per-token compute scales only with the number of activated experts k. With E=8, top-2, the FFN parameter count is 8	imes a single expert, yet each token routes through only 2 experts, so FFN FLOPs match a dense model with 2 FFNs (plus a tiny router and gather/scatter cost). So you store ~8	imes FFN params but pay ~2	imes FFN compute. What does NOT shrink: memory/VRAM (all experts must be resident), so MoE trades cheap FLOPs for expensive memory and bandwidth. Attention and embeddings are unchanged. The win is conditional computation — more capacity per FLOP, not per byte.

Question 23

Define capacity factor in an MoE layer and analyze the tradeoffs of setting it to 1.0 vs 1.25 vs 2.0. Include what happens to dropped tokens, padding waste, and why training and inference often use different values.

Accepted Answer

Capacity factor C sets each expert's buffer to C \cdot (	ext{tokens}/E) slots; tokens beyond that are dropped (skip the FFN via the residual). C=1.0 means zero slack — under any imbalance, tokens drop, degrading quality, but minimal padding compute. C=1.25 is the common train compromise: tolerates moderate imbalance with ~25% padding overhead. C=2.0 nearly eliminates drops but doubles dispatch buffers and FLOPs/memory, mostly wasted on padding. The drop rate depends on how well load-balancing works, so good balancing lets you lower C. Inference (especially batch-1 or expert-choice variants) often uses higher or effectively-unbounded capacity to avoid quality-hurting drops, since latency tolerance and batch shapes differ from training; training picks C to bound the fixed dispatch tensor for static shapes/efficiency.

Question 24

Synthetic data for training: what is model collapse, and how do teams use synthetic data without triggering it?

Accepted Answer

Model collapse is degradation that occurs when models are trained recursively on their own (or other models') unfiltered outputs: tail/rare modes vanish, errors compound, and the distribution narrows toward the mean over generations. Safe use relies on not closing the loop blindly: keep real human data anchoring the mix, use a stronger teacher (distillation is not self-training), aggressively filter/verify synthetic data (reward-model scoring, execution feedback for code/math, rejection sampling, deduplication), and use synthetic data to cover targeted gaps (instruction diversity, hard reasoning traces) rather than to replace the base corpus. Verification is the key — synthetic data with a ground-truth checker is the strongest case.

Question 25

Why does the KL-to-reference penalty in RLHF/DPO actually prevent reward hacking, and what does setting β too low or too high do?

Accepted Answer

The reward model is an imperfect proxy learned on a limited preference distribution; an unconstrained policy will find out-of-distribution outputs that score spuriously high (reward hacking, Goodhart's law). The KL penalty \beta\,\mathrm{KL}(\pi_	heta\|\pi_{	ext{ref}}) keeps the policy near the trusted SFT distribution where the reward model is reliable, trading a little reward for staying on-distribution. Too-high \beta pins the policy to the reference and it barely improves (under-optimization); too-low \beta lets it drift into reward-hacked, degenerate, or repetitive outputs that game the proxy while real quality drops — the classic over-optimization U-curve of true reward vs proxy reward.

Question 26

Design an alignment pipeline for a customer-facing assistant that must be helpful, refuse unsafe requests, and stay cheap to iterate on. Justify each stage.

Accepted Answer

Start with SFT on high-quality demonstrations (incl. ideal refusals) to set baseline behavior and format. Add preference optimization for nuanced 'helpful vs harmful' tradeoffs — prefer DPO (or KTO) for stability/cost over PPO unless you need online exploration; use a mix of human and RLAIF/constitutional preferences against an explicit policy document for scalable, auditable safety targets. Keep a frozen reference and a KL constraint to avoid reward hacking and capability regression. Use LoRA so you can cheaply re-tune and swap variants. Guard at inference with a separate classifier/system prompt and red-team continuously; evaluate helpfulness and safety on held-out adversarial sets, watching for over-refusal. Mix in general data to prevent forgetting.

Question 27

Derive the DPO loss from the RLHF objective and the Bradley-Terry preference model. Show explicitly why DPO eliminates the need to train a separate reward model.

Accepted Answer

The KL-constrained RLHF objective \max_\pi \mathbb{E}[r(x,y)] - \beta\,\mathrm{KL}(\pi\|\pi_{ref}) has the closed-form optimum \pi^*(y|x)=\frac{1}{Z(x)}\pi_{ref}(y|x)\exp(r(x,y)/\beta). Inverting gives the implicit reward r(x,y)=\beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)}+\beta\log Z(x). Substituting into the Bradley-Terry model P(y_w\succ y_l)=\sigma(r(x,y_w)-r(x,y_l)), the intractable \log Z(x) cancels (same x), yielding \mathcal{L}_{DPO}=-\mathbb{E}\big[\log\sigma(\beta\log\frac{\pi_	heta(y_w|x)}{\pi_{ref}(y_w|x)}-\beta\log\frac{\pi_	heta(y_l|x)}{\pi_{ref}(y_l|x)})\big]. The policy IS the reward model, so no separate reward net or RL rollout is needed.

Question 28

DPO has no explicit KL coefficient like PPO, yet practitioners observe it can still over-optimize and degrade. Explain how reward over-optimization manifests in DPO and what role $\beta$ and the reference model play.

Accepted Answer

DPO's \beta is the implicit KL strength: smaller \beta lets \pi_	heta diverge further from \pi_{ref}. Over-optimization shows as the loss continuing to push \log\frac{\pi_	heta(y_w)}{\pi_{ref}(y_w)} up and \log\frac{\pi_	heta(y_l)}{\pi_{ref}(y_l)} down unboundedly — but a known pathology is that DPO often achieves the margin by *decreasing the chosen logprob slightly while crashing the rejected logprob much more*, i.e. it can reduce \pi_	heta(y_w) in absolute terms. This drains probability mass to unseen out-of-distribution responses, degrading fluency. Because \pi_{ref} is fixed, late in training the policy is far from it so the implicit reward is unreliable, and small \beta plus noisy/length-biased preferences amplify this. Mitigations: tune \beta, add an SFT/NLL anchor on y_w, or length-normalize.

Question 29

Why is an auxiliary load-balancing loss needed in token-choice MoE, what failure mode does it prevent, and how does the Switch/GShard auxiliary loss actually work? Then explain DeepSeek's argument for replacing it with an auxiliary-loss-free bias scheme.

Accepted Answer

Without balancing, routing self-reinforces: a few experts get most tokens, train faster, attract more tokens — expert collapse — wasting capacity and causing dropped tokens elsewhere. The Switch/GShard aux loss is \alpha \cdot E \sum_i f_i \cdot P_i, where f_i is the fraction of tokens dispatched to expert i and P_i is the mean router probability for i; minimizing it pushes both toward uniform 1/E. It's added to the LM loss, scaled by a small \alpha. DeepSeek-V3 argues this aux loss fights the LM objective (interference gradients hurt quality), so it instead adds a per-expert bias to routing logits, nudged up/down by an EMA of each expert's load — balancing the assignment without injecting a gradient into the loss.

LLM Pretraining, Fine-tuning, PEFT & Alignment