Transformers & Attention

How self-attention and the Transformer architecture work — the engine behind every modern LLM.

Study these interactively →
Foundationalconcept

Write the scaled dot-product attention formula and explain what Q, K, and V represent.

Scaled dot-product attention is $\text{Attention}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$. Queries $Q$ encode what each token is looking for, keys $K$ encode what each token offers, and values $V$ carry the content actually aggregated. The dot product $QK^\top$ scores query-key compatibility, softmax turns each row of scores into a probability distribution over positions, and that distribution forms a weighted average of the value vectors. Each output row is thus a content-based mixture of all tokens' values, weighted by relevance.
#attention#qkv#softmax
Foundationalconcept

What does multi-head attention add over a single attention head, and how are heads combined?

Multi-head attention runs $h$ attention functions in parallel, each on a learned linear projection of the input into a $d_k=d_{model}/h$ subspace, then concatenates outputs and applies a final projection $W^O$. Each head can attend to a different relational pattern (syntax, coreference, positional offsets) in its own subspace, whereas a single head with one softmax averages everything into one distribution and cannot represent multiple relations at once. Cost is comparable to one full-dimension head because each head is narrower. It increases representational diversity, not just raw capacity.
#multi-head#projection#subspace
Foundationalconcept

Contrast encoder-only, decoder-only, and encoder-decoder transformer architectures with a representative model for each.

Encoder-only (BERT) uses bidirectional self-attention so every token sees full left and right context; ideal for understanding tasks like classification and span extraction. Decoder-only (GPT) uses causal/masked self-attention so each token sees only prior tokens; ideal for autoregressive generation. Encoder-decoder (T5, original Transformer) encodes the source bidirectionally then a decoder generates autoregressively while cross-attending to encoder outputs; ideal for seq2seq like translation and summarization. The defining difference is the attention mask and whether generation is conditioned on a separately encoded input.
#bert#gpt#t5#encoder-decoder
Foundationalconcept

What does the position-wise feed-forward block do, and why the expand-then-contract shape?

After attention, each token passes independently through an FFN: two linear layers with a nonlinearity, $\text{FFN}(x)=W_2\,\sigma(W_1 x+b_1)+b_2$, typically expanding to $4d_{model}$ then back. Attention mixes information across positions but is largely linear in the values; the FFN adds per-token nonlinear transformation and is where much factual/feature computation lives. The wide hidden layer gives capacity to compute many features (often viewed as key-value memories), while contracting back keeps the residual-stream dimension fixed. Modern variants use gated units (SwiGLU/GeGLU) instead of plain ReLU/GELU for better quality.
#ffn#mlp#swiglu
Intermediatemath

Why is the dot product scaled by $\sqrt{d_k}$ in scaled dot-product attention?

If query and key components are independent with mean 0 and variance 1, their dot product over $d_k$ dimensions has variance $d_k$, so its magnitude grows like $\sqrt{d_k}$. Large logits push softmax into saturated regions where one weight nears 1 and the rest near 0, making gradients vanishingly small and training unstable. Dividing by $\sqrt{d_k}$ rescales the dot product back to unit variance, keeping softmax in a well-conditioned regime with usable gradients. It is fixed variance normalization, not a learned parameter.
#scaling#softmax#variance
Intermediateconcept

Why is self-attention's time and memory complexity $O(n^2)$ in sequence length, and why does it parallelize so well?

Attention computes a score for every query-key pair, an $n\times n$ matrix, costing $O(n^2 d)$ time and $O(n^2)$ memory for the score matrix. This quadratic cost is the main bottleneck for long sequences. It parallelizes because, unlike an RNN's sequential recurrence, every position's representation is computed from all others in a single matrix multiply with no time-step dependency; the whole sequence is processed at once and GPUs exploit this as dense matmuls. The tradeoff is classic: more compute/memory for full parallelism and direct long-range paths (max path length $O(1)$ vs an RNN's $O(n)$).
#complexity#parallelism#quadratic
Intermediateconcept

How does causal (masked) self-attention work in a decoder, and how is the mask implemented?

Causal attention prevents a position from attending to future tokens, preserving the autoregressive property that prediction of token $t$ uses only tokens $<t$. It is implemented by adding a mask to the pre-softmax scores: entries above the diagonal (future positions) are set to $-\infty$ (in practice a large negative number), so after softmax those weights become 0. This lets the whole sequence train in parallel with teacher forcing while still respecting left-to-right causality, since masking reproduces exactly what a strictly sequential model would compute. Cross-attention in encoder-decoders is not causally masked.
#causal-mask#decoder#autoregressive
Intermediateconcept

Why do transformers need positional encodings at all, and what is sinusoidal positional encoding?

Self-attention is permutation-equivariant: reordering inputs reorders outputs identically because softmax over a set carries no inherent notion of position. Without positional information the model cannot distinguish 'dog bites man' from 'man bites dog'. Sinusoidal encoding adds to each token's embedding a vector where dimension $i$ uses sine/cosine of position at geometrically spaced frequencies, $PE_{(pos,2i)}=\sin(pos/10000^{2i/d})$. Different frequencies give a unique, smooth, bounded code per position, and the fixed form lets the model express relative offsets via linear combinations and extrapolate somewhat beyond trained lengths.
#positional-encoding#sinusoidal#permutation
Intermediateconcept

What is the KV cache, why does it speed up autoregressive decoding, and what does it cost?

During generation, each new token's attention needs the keys and values of all previous tokens. Without caching, every step recomputes K/V for the entire prefix, $O(n^2)$ total work. The KV cache stores past K and V tensors so each step only computes K/V for the one new token and attends against the cache, making per-step cost linear in context length. The cost is memory: cache size scales as $2\times L\times n\times d_{model}$ (layers, tokens, dims) per sequence, often dominating GPU memory for long contexts and large batches. Multi-query (MQA) and grouped-query (GQA) attention shrink it by sharing K/V across heads.
#kv-cache#inference#gqa#mqa
Intermediateconcept

Contrast BERT, GPT, and T5 on their pretraining objectives, and explain why each objective fits its architecture.

BERT: encoder-only, bidirectional attention, pretrained with masked language modeling (predict masked tokens), originally plus next-sentence prediction (later shown unnecessary by RoBERTa). GPT: decoder-only, causal attention, pretrained with autoregressive next-token prediction; this enables open-ended generation and in-context/few-shot learning at scale. T5: encoder-decoder with a bidirectional encoder and causal cross-attending decoder, pretrained with span corruption (mask spans, generate them) under a text-to-text framing. The split is principled: a masked/denoising objective needs bidirectional context (encoder), a causal objective needs a decoder, and span infilling needs the seq2seq encoder-decoder.
#bert#gpt#t5#pretraining
Intermediatecoding

Implement scaled dot-product attention with an optional causal mask in NumPy-style pseudocode.

def attention(Q, K, V, causal=False): compute d_k = Q.shape[-1]; scores = Q @ K.transpose(-1,-2) / sqrt(d_k). If causal: build mask = triu(ones(n,n), k=1).astype(bool) and set scores[..., mask] = -inf (broadcasting over batch/heads). Stable softmax: scores = scores - scores.max(-1, keepdims=True); w = exp(scores); w = w / w.sum(-1, keepdims=True). Return w @ V. Shapes: Q,K,V are (batch, heads, n, d_k); scores and weights are (batch, heads, n, n); output (batch, heads, n, d_k). The max-subtraction prevents overflow; the upper-triangular $-\infty$ enforces causality before softmax.
#implementation#softmax#causal-mask
Intermediatecert

You are migrating a decoder model from absolute learned positional embeddings to RoPE to extend context. Which behavior should you most expect, and why? (a) Better length extrapolation since RoPE encodes relative position (b) Larger parameter count (c) Loss of parallel training (d) Need for an encoder

(a). RoPE injects position by rotating Q and K so attention scores depend on relative offset $n-m$ rather than absolute index, which generalizes better beyond the training length, especially with NTK-aware or YaRN frequency scaling. It adds essentially no parameters (rotations are deterministic functions of position), so (b) is wrong. Training stays fully parallel because masking and rotation are applied per position simultaneously, so (c) is wrong. RoPE is a property of decoder/self-attention and needs no encoder, so (d) is wrong. The migration's headline benefit is improved relative-position handling and length extrapolation.
#rope#mcq#length-extrapolation
Advancedconcept

Compare sinusoidal, learned, RoPE, and ALiBi positional encodings on how they inject position and their length-extrapolation behavior.

Sinusoidal adds fixed multi-frequency vectors to inputs: parameter-free, mild extrapolation. Learned absolute embeddings are a trainable table indexed by position: flexible but cannot index positions beyond the trained max, so they fail to extrapolate. RoPE rotates Q and K by a position-dependent angle so their dot product depends only on relative offset; it injects position multiplicatively inside attention, extrapolates moderately, and underlies most modern LLMs (often with NTK-aware/YaRN scaling). ALiBi adds a fixed linear distance penalty to attention scores (no embeddings), is cheap, and extrapolates strongly to longer contexts. RoPE and ALiBi are inherently relative; absolute schemes are not.
#rope#alibi#positional-encoding#extrapolation
Advancedconcept

What is the difference between pre-layer-norm and post-layer-norm transformer placement, and why has pre-LN become standard?

Post-LN (original Transformer) applies LayerNorm after the residual add: $\text{LN}(x+\text{Sublayer}(x))$. Pre-LN applies it inside the branch: $x+\text{Sublayer}(\text{LN}(x))$. In post-LN the residual stream passes through normalization, so gradients to early layers can explode or vanish, typically requiring careful learning-rate warmup. Pre-LN keeps a clean identity residual path from input to output, giving well-behaved gradients at initialization and enabling stable training of very deep models with little or no warmup. The cost is that the residual stream grows unnormalized, sometimes hurting final quality slightly; hybrids like sandwich-LN and DeepNorm address this.
#layernorm#pre-ln#residual#training-stability
Advancedconcept

How does FlashAttention achieve speedups and memory savings without approximating attention?

FlashAttention computes exact attention but is IO-aware: standard attention materializes the full $n\times n$ score matrix in slow HBM, making it memory-bandwidth bound. FlashAttention tiles Q, K, V into blocks that fit in fast on-chip SRAM and fuses the matmul, softmax, and value aggregation in a single kernel, never writing the full score matrix to HBM. It uses online (streaming) softmax to combine block partial sums while maintaining a running max and normalizer, so the result is numerically identical to standard softmax. Memory drops from $O(n^2)$ to $O(n)$ and wall-clock improves by avoiding HBM traffic; the backward pass recomputes scores instead of storing them.
#flash-attention#io-aware#online-softmax#gpu
Advancedsystem-design

Design a transformer-based long-document QA system that must handle 200K-token contexts under tight latency and memory budgets. What attention and serving choices do you make?

Use a decoder-only LLM with RoPE (NTK-aware/YaRN-scaled) or ALiBi for length generalization, and a FlashAttention kernel (v2/v3) for exact $O(n)$-memory attention. Cut KV-cache memory with grouped-query attention and quantize the cache (e.g., FP8/INT8). For 200K tokens, avoid recomputing the prompt each request via prefix/prompt caching, and consider sliding-window or sparse attention only if recall tests allow. Serve with paged KV-cache (vLLM-style) and continuous batching for throughput; chunked prefill to bound latency. If full attention is too costly, fall back to retrieval (RAG) over chunks, attending only to retrieved spans. Validate with needle-in-a-haystack and long-context recall benchmarks, not just perplexity.
#long-context#kv-cache#vllm#rag
Advancedconcept

Standard softmax attention is permutation-equivariant and content-based, so two identical tokens at different positions get identical Q/K/V before positional info. Walk through exactly where position must enter so the model can distinguish them, comparing additive (sinusoidal) vs rotary injection.

Before any positional signal, identical tokens produce identical embeddings, hence identical Q/K/V and identical attention behavior, so position must be injected. Additive schemes (sinusoidal, learned) add a position vector to the token embedding at the input, which propagates into Q and K via the linear projections; position then affects scores through cross terms in $QK^\top$ but is entangled with content and applied once before the stack. RoPE instead injects position inside attention by rotating Q and K at every layer just before the dot product, so position influences only the relative phase between query and key and never the value content; this is cleaner and relative by construction. Either way, without injection the two tokens stay indistinguishable.
#positional-encoding#permutation-equivariance#rope#sinusoidal
Advancedconcept

Why does RoPE enable a degree of length extrapolation that learned absolute positional embeddings cannot, and what is the mechanism that still limits it?

RoPE encodes position by rotating query/key vectors by an angle proportional to position, so the dot product $q_m^\top k_n$ depends only on the relative offset $m-n$, not absolute indices. This relative formulation means no new parameters or unseen absolute embeddings are needed at longer lengths, unlike learned absolute tables that simply have no row for position $>L_{train}$. The limit: low-frequency (long-wavelength) dimensions complete fewer than one full rotation within training length, so at inference the model sees rotation phases it never observed during training, degrading attention. This out-of-distribution phase, not parameter count, caps naive RoPE extrapolation.
#rope#positional-encoding#extrapolation#attention
Expertmath

Derive why RoPE makes the attention dot product depend only on relative position.

RoPE multiplies each 2D sub-pair of a query/key by a rotation matrix $R(\theta_m)$ whose angle scales with absolute position $m$ and the dimension's frequency. For a query at position $m$ and key at position $n$, the inner product becomes $(R(\theta_m)q)^\top(R(\theta_n)k)=q^\top R(\theta_m)^\top R(\theta_n)k=q^\top R(\theta_{n-m})k$, using $R(a)^\top R(b)=R(b-a)$ since rotations are orthogonal and compose additively. The absolute positions cancel, leaving dependence only on the offset $n-m$. Thus RoPE encodes relative position exactly through the algebra of 2D rotations, with no added bias term.
#rope#rotation#relative-position#derivation
Expertmath

Derive the online (streaming) softmax recurrence that lets attention be computed block-by-block, as in FlashAttention.

For numerically stable softmax we track a running max $m$, running denominator $\ell=\sum e^{x_j-m}$, and the running weighted value sum $o$. Given a new block with local max $\tilde m$ and local sums, update $m^{new}=\max(m,\tilde m)$. Rescale the old accumulators by $e^{m-m^{new}}$ and the new block's by $e^{\tilde m-m^{new}}$ so all terms share the same shift: $\ell^{new}=e^{m-m^{new}}\ell+e^{\tilde m-m^{new}}\tilde\ell$, and likewise for $o$. After the last block, divide $o$ by $\ell$. Because every term is renormalized to the common max, the result equals the full-sequence softmax exactly, with $O(1)$ extra state per row.
#online-softmax#derivation#numerical-stability
Expertconcept

Attention weights are often cited as model explanations. Why is treating them as faithful explanations problematic?

Attention weights show where a layer mixes information, but they are not a reliable causal account of the prediction. Studies (Jain & Wallace 2019; Serrano & Smith 2019) show weights can be altered, even to near-uniform or alternative distributions, while leaving outputs largely unchanged, so they are neither unique nor necessary. Information also flows through residual connections, the value projection, and many layers/heads, so a single layer's weights ignore most of the computation. High attention to a token does not imply high influence on the logit. Faithful attribution needs gradient- or perturbation-based methods (integrated gradients, causal ablation, attention rollout used with care), not raw softmax weights.
#interpretability#attention-weights#faithfulness
Expertconcept

Why does dot-product attention use a softmax rather than raw normalized scores, and what failure mode appears with very long sequences?

Softmax turns arbitrary real scores into a nonnegative distribution summing to 1, giving a differentiable, sharpenable weighting that competes positions against each other so larger logits win exponentially. With very long sequences softmax tends to over-disperse: the normalizer grows, individual weights shrink, and the model struggles to concentrate (an attention-entropy/dilution problem), while irrelevant tokens still leak nonzero mass. This contributes to long-context recall degradation and 'lost in the middle' effects. Mitigations include temperature/logit scaling, attention sinks (a dump token absorbing surplus mass), and distance-decaying biases like ALiBi, plus better positional scaling such as YaRN.
#softmax#long-context#attention-dilution
Expertconcept

Contrast Position Interpolation, NTK-aware scaling, and YaRN for extending RoPE context. Why does NTK-aware scaling beat uniform interpolation, and what does YaRN add on top?

Linear Position Interpolation divides all positions by scale $s$ so position $L_{new}$ maps to $L_{train}$; uniform scaling compresses high-frequency dimensions too, blurring the local detail the model relies on for adjacent tokens. NTK-aware scaling instead modifies the RoPE base $\theta$ so high frequencies are barely touched while low frequencies are interpolated most, matching where the OOD phase problem actually lives, so it often works with little or no fine-tuning. YaRN refines this with per-dimension frequency-band treatment (NTK-by-parts: interpolate, extrapolate, or blend by wavelength) plus a temperature/attention-scaling factor that rescales logits to preserve entropy at long context, giving strong extension with minimal fine-tuning.
#yarn#ntk-scaling#position-interpolation#rope#long-context
Expertmath

Derive ALiBi's attention bias and explain how the per-head slope schedule is chosen and why ALiBi extrapolates to unseen lengths essentially for free.

ALiBi adds a non-learned bias to pre-softmax attention: for query $i$, key $j$, score $=q_i^\top k_j - m\,(i-j)$ for $j\le i$, a linear penalty growing with distance, scaled by a head-specific slope $m$. With $h$ heads, slopes form a geometric sequence $m_k = 2^{-8k/h}$ (i.e. ratio $2^{-8/h}$, starting at $2^{-8/h}$), giving each head a different effective receptive window. Because the bias is purely a function of relative distance $i-j$ and is computed analytically at any length, no positional state is learned or stored, so longer sequences just extend the same linear ramp; the recency prior degrades gracefully rather than hitting unseen embeddings, yielding strong train-short/test-long extrapolation.
#alibi#attention-bias#slopes#extrapolation#long-context