LLM Pretraining, Fine-tuning, PEFT & Alignment

How large language models are pretrained, fine-tuned (LoRA/PEFT), and aligned (RLHF, DPO).

Study these interactively →
Foundationalconcept

What is the next-token prediction objective, and why is it sufficient to learn general language ability?

Autoregressive pretraining minimizes the cross-entropy of predicting token $x_t$ given all prior tokens, i.e. maximizing $\sum_t \log p_\theta(x_t \mid x_{<t})$. Equivalently it minimizes per-token negative log-likelihood (perplexity). It is sufficient because accurately predicting the next token over diverse web-scale text forces the model to internalize syntax, facts, reasoning patterns, and world structure — the only way to drive the loss down on hard continuations is to model the underlying generative process. It is self-supervised: labels are the text itself, so it scales without human annotation.
#pretraining#autoregressive#cross-entropy#perplexity
Foundationalconcept

Distinguish SFT from instruction tuning — are they the same thing?

SFT (supervised fine-tuning) is the general mechanism: continue next-token training on curated (prompt, target) pairs so the loss is computed (usually) only on the response tokens. Instruction tuning is a specific application of SFT where the data is a broad mix of tasks phrased as instructions with desired outputs, aiming to make the model follow arbitrary natural-language directives (generalize to unseen instructions). So instruction tuning is SFT, but not all SFT is instruction tuning — you can SFT on a single narrow task (e.g. classification) without any instruction-following goal.
#sft#instruction-tuning#fine-tuning
Intermediateconcept

State the Chinchilla scaling result and contrast it with the earlier Kaplan compute-optimal recipe.

Hoffmann et al. (Chinchilla, 2022) found that for a fixed compute budget $C \approx 6ND$, loss is minimized when parameters $N$ and training tokens $D$ scale roughly equally — about 20 tokens per parameter — so most prior models (GPT-3, Gopher) were badly undertrained. Chinchilla (70B, 1.4T tokens) beat the 280B Gopher. This corrected the earlier Kaplan et al. (2020) recommendation to grow model size far faster than data. The practical lesson: at fixed FLOPs, smaller models trained on more data win, and inference cost favors that even more.
#scaling-laws#chinchilla#compute-optimal#data
Intermediatemath

Compute roughly how many training tokens a Chinchilla-optimal 13B model wants, and the FLOPs that implies.

Chinchilla's rule is ~20 tokens per parameter, so $13\text{B} \times 20 \approx 260\text{B}$ tokens. Training FLOPs follow $C \approx 6ND = 6 \times 13\times10^{9} \times 260\times10^{9} \approx 2.0\times10^{22}$ FLOPs. (The factor 6 is 2 for the forward pass plus ~4 for the backward pass per parameter per token.) In practice modern models are deliberately trained well past 20:1 — e.g. hundreds or thousands of tokens per parameter — to cut inference cost, accepting a less compute-optimal training point.
#scaling-laws#flops#compute#chinchilla
Intermediateconcept

Why does data mixture (and deduplication) matter as much as raw token count in pretraining?

Quality and composition shift the loss-vs-data curve, not just its position. Heavy duplicates cause memorization and waste effective tokens, so near-dedup improves generalization per FLOP. Domain proportions (code, math, multilingual, high-quality prose) determine downstream skills — code in the mix improves reasoning; over-weighting low-quality web text hurts. Practitioners up-weight high-signal sources, schedule a high-quality 'annealing' phase late in training, and balance domains to avoid one swamping rare-but-valuable data. The objective is fixed, but the data distribution it's computed over is the real lever on capability.
#data-mixture#deduplication#pretraining#data-quality
Intermediateconcept

Explain how LoRA works and why it dramatically reduces trainable parameters without adding inference latency.

LoRA freezes the pretrained weight $W_0$ and learns a low-rank update $\Delta W = BA$, where $A \in \mathbb{R}^{r\times d}$, $B \in \mathbb{R}^{d\times r}$, with rank $r \ll d$. The forward pass becomes $h = W_0 x + \frac{\alpha}{r} BA x$. Only $A,B$ train, cutting trainable params by orders of magnitude (and optimizer-state memory with them). It works because fine-tuning updates have low intrinsic rank. At inference you can merge $BA$ into $W_0$, so there is zero added latency — unlike adapters, which insert extra sequential layers.
#lora#peft#low-rank#fine-tuning
Intermediateconcept

Compare LoRA, adapters, and prefix/prompt tuning along where they inject parameters and their inference-time cost.

Adapters insert small bottleneck MLP layers between transformer sublayers — extra sequential compute at inference. LoRA adds a parallel low-rank delta to existing weight matrices — mergeable, so no inference overhead. Prefix/prompt tuning prepends trainable continuous 'virtual token' vectors to the keys/values (prefix) or input embeddings (prompt) and trains only those, leaving all weights frozen — cheapest to store but it consumes context length and tends to underperform LoRA on harder tasks. All are PEFT; LoRA dominates in practice for its quality/no-latency tradeoff, while prompt tuning shines for many cheaply-swappable tasks.
#peft#adapters#prefix-tuning#lora
Intermediateconcept

Walk through the classic three-stage RLHF pipeline (reward model + PPO).

Stage 1: SFT on demonstration data to get a competent instruction-follower. Stage 2: collect human preference comparisons (A vs B) and train a reward model $r_\phi$, typically with the Bradley-Terry loss $-\log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l))$ on chosen/rejected pairs. Stage 3: optimize the policy with PPO to maximize expected reward minus a per-token KL penalty $\beta\,\mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$ against the frozen SFT model, which prevents reward hacking and keeps generations on-distribution. The KL term and clipping are what keep RLHF stable.
#rlhf#ppo#reward-model#bradley-terry
Intermediateconcept

What is RLAIF, and how does Constitutional AI use it?

RLAIF (RL from AI Feedback) replaces human preference labels with an LLM judge that ranks or critiques responses, scaling alignment data cheaply. Anthropic's Constitutional AI is a concrete instance: in the SFT phase the model critiques and revises its own outputs against a written 'constitution' of principles, producing improved training targets; in the RL phase a model generates preference labels by judging pairs against those principles, and that AI feedback trains the reward signal (so it's RL from AI Feedback, RLAIF). The constitution makes the value targets explicit, auditable, and steerable without per-example human labeling.
#rlaif#constitutional-ai#alignment#ai-feedback
Intermediateconcept

What is catastrophic forgetting in fine-tuning, and what mitigations exist?

Catastrophic forgetting is the loss of previously-learned capabilities when fine-tuning shifts weights toward the new task distribution, overwriting the features that supported old behaviors. Mitigations: PEFT (LoRA/adapters) freezes the base, preserving original knowledge; replay/rehearsal mixes in pretraining or general instruction data; lower learning rates and fewer epochs; regularization toward the base (KL or L2, e.g. EWC weighting by Fisher information); and a KL-to-reference penalty in RLHF. The general principle: constrain how far weights drift from the pretrained checkpoint, or keep the original capabilities represented in the training mix.
#catastrophic-forgetting#peft#ewc#replay
Intermediateconcept

Explain knowledge distillation for LLMs, and the difference between hard-label and soft-label (logit) distillation.

Distillation trains a smaller student to mimic a larger teacher. Hard-label distillation trains the student on the teacher's generated text (sequence-level / behavior cloning) with standard cross-entropy — simple, model-agnostic, the basis of most 'synthetic data' distillation. Soft-label (logit) distillation matches the teacher's full output distribution, minimizing KL between teacher and student logits (often temperature-scaled), transferring 'dark knowledge' about relative token probabilities — far more sample-efficient but requires teacher logits and matched tokenizers. On-policy variants (the student generates, the teacher scores) reduce exposure bias from pure offline distillation.
#distillation#kl-divergence#soft-labels#compression
Intermediateconcept

How does a Mixture-of-Experts layer work, and why does it decouple parameter count from per-token compute?

An MoE replaces a dense FFN with $N$ expert FFNs plus a router (gating network) that, per token, selects top-$k$ experts (e.g. $k=2$). Only those $k$ experts run, so FLOPs scale with $k$, not $N$ — you get a huge total parameter count (capacity/knowledge) at the active compute of a much smaller dense model. The router outputs softmax weights over experts; outputs are the weighted sum of the chosen experts. This sparse activation is why MoE models advertise total vs active parameters separately.
#moe#sparse#routing#experts
Advancedconcept

What does QLoRA add on top of LoRA, and what is the role of NF4 and double quantization?

QLoRA fine-tunes a 4-bit-quantized frozen base model while keeping LoRA adapters in higher precision (bf16), making fine-tuning of large models feasible on a single GPU. NF4 (4-bit NormalFloat) is an information-theoretically-optimal quantization for normally-distributed weights, giving better fidelity than int4 at the same bit width. Double quantization further quantizes the quantization constants, saving extra memory. Paged optimizers handle memory spikes. Crucially, gradients flow through the dequantized base into the LoRA adapters; the base stays frozen and quantized, so quality stays close to 16-bit full LoRA.
#qlora#nf4#quantization#peft
Advancedmath

Derive the DPO objective and explain why it eliminates the explicit reward model.

RLHF's KL-constrained reward-maximization has the closed-form optimum $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)$, so $r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta\log Z(x)$. Substituting this reward into the Bradley-Terry preference likelihood cancels the intractable partition $Z(x)$, giving DPO's loss $-\log\sigma\big(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\big)$. The policy is implicitly its own reward model, so you train directly on preference pairs with a simple supervised loss — no separate RM, no sampling/PPO loop.
#dpo#rlhf#bradley-terry#derivation
Advancedconcept

What are the practical advantages and failure modes of DPO versus PPO-based RLHF?

DPO is far simpler and more stable: offline, no reward model, no rollouts, no PPO clipping/value-function tuning, lower compute and variance. Failure modes: it's offline so it only learns from the fixed preference set (no on-policy exploration), can over-optimize and push probability mass off-distribution, is sensitive to the reference model and to noisy/length-biased preferences, and the implicit reward can drift. PPO with an online reward model can keep generating fresh on-policy samples and reward-shape more flexibly, at the cost of a fragile, expensive training loop. Variants (IPO, KTO, online DPO) target DPO's overfitting and offline limits.
#dpo#ppo#alignment#tradeoffs
Advancedconcept

When would you choose full fine-tuning over PEFT despite PEFT's efficiency?

Choose full fine-tuning when you need maximal capability shift that low-rank updates can't capture: large domain adaptation (a very different language, modality, or knowledge regime), when you have abundant high-quality data and compute, when continued pretraining is the goal, or when you'll serve a single specialized model (no need to swap adapters). Full FT can exceed LoRA on hard tasks because it isn't constrained to a low-rank subspace. PEFT wins when you need many task-specific variants, limited VRAM, fast iteration, or preservation of base capabilities. Rule of thumb: small targeted adaptation → PEFT; deep capability or distribution change → full FT.
#full-fine-tuning#peft#domain-adaptation
Advancedconcept

What is the load-balancing problem in MoE training, and how is it addressed?

Without intervention the router collapses, sending most tokens to a few favored experts — they get all the gradient, the rest stay undertrained, and capacity is wasted (plus uneven expert load causes dropped tokens under fixed capacity factors). Fixes: an auxiliary load-balancing loss penalizing the product of fraction-of-tokens-routed and mean-router-probability per expert (encouraging uniform usage); expert capacity limits with token dropping/overflow; noisy top-k gating to encourage exploration; and newer auxiliary-loss-free schemes (e.g. DeepSeek-V3) that add per-expert bias terms adjusted to equalize load. Balanced routing is essential for both quality and hardware utilization in distributed expert-parallel training.
#moe#load-balancing#routing#auxiliary-loss
Advancedconcept

What is continued (continual) pretraining, and what are the key pitfalls when adapting a base model to a new domain?

Continued pretraining resumes the next-token objective on domain or language-specific corpora (medical, legal, code, a new language) to inject knowledge before any SFT. Pitfalls: catastrophic forgetting of general ability (mitigate by mixing in a slice of the original distribution, e.g. 5–30% replay); learning-rate scheduling — restart with a warmup and a peak LR well below the original to avoid destabilizing the checkpoint; data quality and dedup matter even more at smaller scale; tokenizer mismatch for new languages may need vocabulary extension. It precedes, not replaces, instruction tuning and alignment.
#continued-pretraining#domain-adaptation#replay#forgetting
Advancedconcept

In LoRA, what do rank r and scaling α control, what is the typical α/r convention, and what does rank tell you about a task?

$r$ is the dimensionality (capacity) of the low-rank update — higher $r$ captures more complex adaptations but costs more params; $\alpha$ scales the update via the factor $\alpha/r$, effectively setting the update's learning-rate magnitude relative to the frozen base. A common convention is $\alpha = 2r$ (or $\alpha=r$), chosen so the effective scale stays stable as you sweep $r$. Empirically, simple stylistic/format adaptation needs low rank (4–16); tasks demanding new knowledge or large behavioral change benefit from higher rank or full FT. If raising $r$ keeps helping, the task's update isn't low-rank — a signal to consider full fine-tuning.
#lora#rank#alpha#hyperparameters
Advancedconcept

In PPO-based RLHF, what failure occurs if you remove the per-token KL penalty against the reference policy, and mechanistically why does the KL term prevent it?

Without the KL penalty the policy over-optimizes the learned reward and undergoes reward hacking / mode collapse: it drifts far off-distribution into degenerate, high-reward-but-low-quality text (repetition, sycophancy, gibberish the reward model mis-scores) because the reward model is only accurate near $\pi_{ref}$'s support and is exploitable elsewhere. The KL term $-\beta\,\mathrm{KL}(\pi_\theta\|\pi_{ref})$ adds a per-token cost for diverging, acting as a trust region / regularizer that keeps the policy in the region where reward estimates are reliable and preserves the base model's fluency and diversity. It trades reward magnitude for staying on-distribution, mitigating Goodhart's law on a proxy reward.
#ppo#rlhf#kl-penalty#reward-hacking
Advancedsystem-design

Compare DPO and PPO for RLHF on stability, distribution coverage, and susceptibility to reward hacking. When would a staff engineer still choose PPO despite DPO's simplicity?

DPO is offline and contrastive: simpler, cheaper, no reward model or rollouts, lower-variance gradients, but it only sees the fixed preference dataset's $(y_w,y_l)$ pairs — it cannot explore, so it inherits and can amplify dataset biases (length, annotator quirks) and can push mass to unobserved OOD outputs. PPO is online: it samples fresh completions scored by an explicit reward model, giving broader coverage and an explicit KL trust region, which often yields higher peak quality and lets reward-model errors be caught/iterated — at the cost of instability, hyperparameter sensitivity, and compute. Choose PPO when you have a strong reusable reward model, want on-policy exploration beyond the static preference set, need to combine multiple reward signals (safety + helpfulness), or DPO has plateaued/over-optimized. Reward hacking hits both, but PPO's explicit RM is the more classically exploitable proxy.
#dpo#ppo#rlhf#alignment#tradeoffs
Advancedmath

In a top-k MoE layer, derive why the model gets far more parameters than a dense model at roughly the same per-token FLOPs, and state precisely what does and doesn't scale. Use Mixtral-style numbers (8 experts, top-2) to ground the answer.

Total parameters scale with the number of experts $E$ (each expert is a full FFN), but per-token compute scales only with the number of activated experts $k$. With $E=8$, top-$2$, the FFN parameter count is $8\times$ a single expert, yet each token routes through only $2$ experts, so FFN FLOPs match a dense model with $2$ FFNs (plus a tiny router and gather/scatter cost). So you store ~$8\times$ FFN params but pay ~$2\times$ FFN compute. What does NOT shrink: memory/VRAM (all experts must be resident), so MoE trades cheap FLOPs for expensive memory and bandwidth. Attention and embeddings are unchanged. The win is conditional computation — more capacity per FLOP, not per byte.
#moe#sparse-models#flops#mixtral#conditional-compute
Advancedconcept

Define capacity factor in an MoE layer and analyze the tradeoffs of setting it to 1.0 vs 1.25 vs 2.0. Include what happens to dropped tokens, padding waste, and why training and inference often use different values.

Capacity factor $C$ sets each expert's buffer to $C \cdot (\text{tokens}/E)$ slots; tokens beyond that are dropped (skip the FFN via the residual). $C=1.0$ means zero slack — under any imbalance, tokens drop, degrading quality, but minimal padding compute. $C=1.25$ is the common train compromise: tolerates moderate imbalance with ~25% padding overhead. $C=2.0$ nearly eliminates drops but doubles dispatch buffers and FLOPs/memory, mostly wasted on padding. The drop rate depends on how well load-balancing works, so good balancing lets you lower $C$. Inference (especially batch-1 or expert-choice variants) often uses higher or effectively-unbounded capacity to avoid quality-hurting drops, since latency tolerance and batch shapes differ from training; training picks $C$ to bound the fixed dispatch tensor for static shapes/efficiency.
#moe#capacity-factor#token-dropping#throughput#padding
Expertconcept

Synthetic data for training: what is model collapse, and how do teams use synthetic data without triggering it?

Model collapse is degradation that occurs when models are trained recursively on their own (or other models') unfiltered outputs: tail/rare modes vanish, errors compound, and the distribution narrows toward the mean over generations. Safe use relies on not closing the loop blindly: keep real human data anchoring the mix, use a stronger teacher (distillation is not self-training), aggressively filter/verify synthetic data (reward-model scoring, execution feedback for code/math, rejection sampling, deduplication), and use synthetic data to cover targeted gaps (instruction diversity, hard reasoning traces) rather than to replace the base corpus. Verification is the key — synthetic data with a ground-truth checker is the strongest case.
#synthetic-data#model-collapse#rejection-sampling#verification
Expertconcept

Why does the KL-to-reference penalty in RLHF/DPO actually prevent reward hacking, and what does setting β too low or too high do?

The reward model is an imperfect proxy learned on a limited preference distribution; an unconstrained policy will find out-of-distribution outputs that score spuriously high (reward hacking, Goodhart's law). The KL penalty $\beta\,\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}})$ keeps the policy near the trusted SFT distribution where the reward model is reliable, trading a little reward for staying on-distribution. Too-high $\beta$ pins the policy to the reference and it barely improves (under-optimization); too-low $\beta$ lets it drift into reward-hacked, degenerate, or repetitive outputs that game the proxy while real quality drops — the classic over-optimization U-curve of true reward vs proxy reward.
#rlhf#kl-penalty#reward-hacking#goodhart
Expertsystem-design

Design an alignment pipeline for a customer-facing assistant that must be helpful, refuse unsafe requests, and stay cheap to iterate on. Justify each stage.

Start with SFT on high-quality demonstrations (incl. ideal refusals) to set baseline behavior and format. Add preference optimization for nuanced 'helpful vs harmful' tradeoffs — prefer DPO (or KTO) for stability/cost over PPO unless you need online exploration; use a mix of human and RLAIF/constitutional preferences against an explicit policy document for scalable, auditable safety targets. Keep a frozen reference and a KL constraint to avoid reward hacking and capability regression. Use LoRA so you can cheaply re-tune and swap variants. Guard at inference with a separate classifier/system prompt and red-team continuously; evaluate helpfulness and safety on held-out adversarial sets, watching for over-refusal. Mix in general data to prevent forgetting.
#alignment#system-design#dpo#rlaif#safety
Expertmath

Derive the DPO loss from the RLHF objective and the Bradley-Terry preference model. Show explicitly why DPO eliminates the need to train a separate reward model.

The KL-constrained RLHF objective $\max_\pi \mathbb{E}[r(x,y)] - \beta\,\mathrm{KL}(\pi\|\pi_{ref})$ has the closed-form optimum $\pi^*(y|x)=\frac{1}{Z(x)}\pi_{ref}(y|x)\exp(r(x,y)/\beta)$. Inverting gives the implicit reward $r(x,y)=\beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)}+\beta\log Z(x)$. Substituting into the Bradley-Terry model $P(y_w\succ y_l)=\sigma(r(x,y_w)-r(x,y_l))$, the intractable $\log Z(x)$ cancels (same $x$), yielding $\mathcal{L}_{DPO}=-\mathbb{E}\big[\log\sigma(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)})\big]$. The policy IS the reward model, so no separate reward net or RL rollout is needed.
#dpo#rlhf#bradley-terry#derivation
Expertconcept

DPO has no explicit KL coefficient like PPO, yet practitioners observe it can still over-optimize and degrade. Explain how reward over-optimization manifests in DPO and what role $\beta$ and the reference model play.

DPO's $\beta$ is the implicit KL strength: smaller $\beta$ lets $\pi_\theta$ diverge further from $\pi_{ref}$. Over-optimization shows as the loss continuing to push $\log\frac{\pi_\theta(y_w)}{\pi_{ref}(y_w)}$ up and $\log\frac{\pi_\theta(y_l)}{\pi_{ref}(y_l)}$ down unboundedly — but a known pathology is that DPO often achieves the margin by *decreasing the chosen logprob slightly while crashing the rejected logprob much more*, i.e. it can reduce $\pi_\theta(y_w)$ in absolute terms. This drains probability mass to unseen out-of-distribution responses, degrading fluency. Because $\pi_{ref}$ is fixed, late in training the policy is far from it so the implicit reward is unreliable, and small $\beta$ plus noisy/length-biased preferences amplify this. Mitigations: tune $\beta$, add an SFT/NLL anchor on $y_w$, or length-normalize.
#dpo#over-optimization#beta#reward-hacking
Expertconcept

Why is an auxiliary load-balancing loss needed in token-choice MoE, what failure mode does it prevent, and how does the Switch/GShard auxiliary loss actually work? Then explain DeepSeek's argument for replacing it with an auxiliary-loss-free bias scheme.

Without balancing, routing self-reinforces: a few experts get most tokens, train faster, attract more tokens — expert collapse — wasting capacity and causing dropped tokens elsewhere. The Switch/GShard aux loss is $\alpha \cdot E \sum_i f_i \cdot P_i$, where $f_i$ is the fraction of tokens dispatched to expert $i$ and $P_i$ is the mean router probability for $i$; minimizing it pushes both toward uniform $1/E$. It's added to the LM loss, scaled by a small $\alpha$. DeepSeek-V3 argues this aux loss fights the LM objective (interference gradients hurt quality), so it instead adds a per-expert bias to routing logits, nudged up/down by an EMA of each expert's load — balancing the assignment without injecting a gradient into the loss.
#moe#load-balancing#expert-collapse#deepseek#auxiliary-loss