What is the fundamental difference between the prefill phase and the decode phase of autoregressive LLM inference, and why does each have a different performance bottleneck?
Prefill processes the entire prompt in one forward pass, computing all input tokens' attention in parallel and populating the KV cache. It is compute-bound (high arithmetic intensity, GPU FLOPs saturated). Decode then generates one token per forward pass, each attending to the cached keys/values; it is memory-bandwidth-bound because for batch size 1 you stream all weights (and the KV cache) from HBM to produce a single token, so FLOP utilization is tiny. This split is why TTFT (time-to-first-token) tracks prompt length while inter-token latency tracks model size and bandwidth, and why serving systems batch decode steps to amortize the weight reads.
#prefill#decode#memory-bound#ttft
Intermediateconcept
Explain greedy decoding, temperature, top-k, top-p (nucleus), and min-p sampling, and state precisely what min-p changes relative to top-p.
Greedy always picks $\arg\max$ logit — deterministic, prone to repetition. Temperature $T$ rescales logits ($z/T$) before softmax: $T<1$ sharpens, $T>1$ flattens. Top-k keeps the k highest-probability tokens then renormalizes. Top-p (nucleus) keeps the smallest set whose cumulative probability $\geq p$, so the candidate set size adapts to the distribution. Min-p instead sets a relative floor: keep tokens with $P \geq p_{\text{min}}\cdot P_{\max}$ (the top token's probability). When the model is confident ($P_{\max}$ high) the threshold is high and the set is narrow; when uncertain it widens. This preserves coherence at high temperature better than fixed top-p, which can admit long low-probability tails.
#sampling#top-p#min-p#temperature
Intermediateconcept
How do Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV-cache pressure, and what is the tradeoff against full Multi-Head Attention?
MHA stores separate K/V for every query head. MQA shares a single K/V head across all query heads, shrinking the KV cache (and the KV bandwidth read in decode) by roughly the head count — a large win since decode is bandwidth-bound. GQA interpolates: query heads are split into g groups, each sharing one K/V head, giving an $n_q/g$ reduction. The tradeoff is representational capacity: MQA can degrade quality on some tasks; GQA (e.g. 8 KV heads) recovers nearly all MHA quality while keeping most of the memory/bandwidth savings, which is why modern models default to GQA. It also enables longer contexts at fixed memory.
#gqa#mqa#attention#kv-cache
Intermediatesystem-design
What is continuous (in-flight) batching and why does it dramatically outperform static batching for LLM serving?
Static batching forms a fixed batch, runs all requests to completion, then accepts new ones — so the whole batch waits on the longest generation and the GPU idles as short requests finish (head-of-line blocking, padding waste). Continuous batching schedules at the iteration (token) level: finished sequences are evicted the moment they emit EOS and waiting requests are admitted into the freed slots mid-flight, with ragged/no padding. Combined with chunked prefill it mixes a compute-bound prefill chunk and many bandwidth-bound decodes in one batch to keep the GPU saturated. The result is far higher throughput and lower average latency under realistic variable-length, bursty traffic.
Contrast INT8, INT4, and FP8 weight quantization for inference as of 2026: precision, quality, and hardware. Why has FP8 become a practical sweet spot on recent GPUs?
INT8 (1 byte) halves memory vs BF16 with near-lossless quality and is well supported. INT4 (½ byte) quarters memory — letting a 70B model fit in ~35–40 GB — but needs careful per-group scaling/calibration (GPTQ/AWQ) to limit accuracy loss, and is favored for memory-constrained, single/low-batch decode. FP8 (1 byte) keeps a floating-point exponent, so it represents dynamic range and outliers far better than INT8 at the same width, and Hopper/Blackwell tensor cores execute FP8 matmuls natively at high speed. So FP8 W8A8 gives near-BF16 quality with both memory savings and a compute speedup, whereas INT4 saves the most memory but is bandwidth-driven and quality-sensitive — making FP8 a default sweet spot on modern hardware.
#fp8#int8#int4#blackwell
Intermediateconcept
What is the GGUF format and the role of llama.cpp, and what problem class is it designed for versus a vLLM/TensorRT-LLM deployment?
GGUF is the single-file model container used by llama.cpp (successor to GGML), packing weights plus metadata/tokenizer and supporting a spectrum of quantization types (e.g. Q4_K_M, Q5_K, Q8_0) with mixed per-block precision. It targets local/edge and CPU-or-mixed-GPU inference: it mmaps weights, runs on commodity CPUs with optional partial GPU offload of layers, and is optimized for low-batch, single-user latency on consumer hardware. vLLM/TensorRT-LLM instead target datacenter GPUs and high-concurrency throughput with PagedAttention, continuous batching, and tensor parallelism. So GGUF/llama.cpp = portable single-user/edge inference; vLLM/TRT-LLM = multi-tenant server throughput. They optimize opposite ends of the batch-size/hardware spectrum.
#gguf#llama.cpp#edge#cpu-inference
Intermediatesystem-design
Define the key serving cost/performance metrics — TTFT, ITL/TPOT, throughput (tokens/s), goodput — and explain the throughput-vs-latency tension a serving config must trade off.
TTFT (time-to-first-token) is prefill latency, felt as responsiveness. ITL/TPOT (inter-token latency / time-per-output-token) is the streaming cadence during decode. Throughput is aggregate tokens/sec across all concurrent requests; goodput is throughput that meets SLOs (e.g. only requests under a TTFT/ITL target count). The tension: larger batches amortize weight reads and raise GPU utilization and throughput, but queueing and bigger forward passes raise per-request TTFT and ITL. Prioritizing prefill improves TTFT but stalls decodes (raising ITL); prioritizing decode does the reverse. You tune batch size, chunked-prefill token budget, and scheduling policy to maximize goodput — throughput subject to latency SLOs — rather than either metric alone.
#ttft#tpot#throughput#goodput
Intermediateconcept
Explain the core mechanism of speculative decoding. Why does it speed up autoregressive generation when it does extra work (running both a draft and a target model)?
A small fast draft model autoregressively proposes $k$ candidate tokens; the large target model then scores all $k$ in a single parallel forward pass (one matmul over the whole block, since the tokens already exist). Speedup comes from converting $k$ sequential target decode steps into one batched verification — memory-bandwidth-bound decoding is dominated by loading weights once per step, so verifying $k$ tokens in one pass amortizes that load. Accepted tokens advance the sequence; on the first rejection you resample from a corrected distribution and discard the rest. Net win when draft is cheap and acceptance is high.
Why does beam search rarely beat sampling for open-ended LLM generation despite maximizing sequence likelihood, and where is it still preferred?
Beam search approximately maximizes total sequence log-probability, but for open-ended text the highest-likelihood sequences are degenerate — bland, repetitive, prone to loops — because the model's most likely continuations collapse onto safe high-frequency phrases (the 'likelihood trap'/neural text degeneration). Human text is not the mode of the distribution; it occupies a higher-entropy region that sampling methods reach. Beam search also costs $O(\text{beam width})$ more compute and breaks per-token streaming. It remains preferred in constrained, low-entropy tasks where one correct target dominates: machine translation, speech recognition, and short structured extraction, where maximizing likelihood aligns with correctness.
Derive the memory footprint of the KV cache for a transformer and explain why it, not the weights, often limits how many concurrent requests a server can hold.
Per token you store a key and value vector in every layer across all KV heads. Per-token size $= 2 \times L \times n_{kv} \times d_{head} \times \text{bytes}$, where $L$ is layers, $n_{kv}$ KV heads, $d_{head}$ head dim. For a request multiply by sequence length $S$ and batch $B$. Example: an MHA model with $L=80$, hidden $8192$ (so $n_{kv}\cdot d_{head}=8192$), FP16 ($2$ bytes) is $2\cdot80\cdot8192\cdot2 \approx 2.6$ MB per token, so an 8K-token request is ~21 GB. Weights are fixed and shared across all requests, but the KV cache grows linearly with $B\times S$, so at long contexts and high concurrency the cache exhausts HBM first, capping batch size and thus throughput.
#kv-cache#memory#gqa#context-length
Advancedconcept
Explain speculative decoding: how it works, why it preserves the target model's exact output distribution, and the conditions under which it gives no speedup.
A small cheap draft model proposes $k$ tokens autoregressively; the large target model then verifies them in a single parallel forward pass. Each draft token is accepted with a probability derived from the ratio of target to draft probabilities; on the first rejection you resample from a corrected residual distribution and discard the rest. This rejection-sampling scheme is provably distribution-preserving — output is statistically identical to sampling from the target alone. Speedup comes from turning many memory-bound single-token decodes into one batched verification. It fails to help when draft acceptance is low (draft poorly matches target), when k is mistuned, or when the target is already compute-bound (large batch) with no spare parallelism; drafting overhead can then make it slower.
What problem does PagedAttention (vLLM) solve, and how does borrowing the OS virtual-memory paging idea improve serving throughput?
Naive serving allocates a contiguous KV-cache buffer sized for each request's max possible length, causing internal and external fragmentation and reserving memory that is never used — so few requests fit. PagedAttention splits the KV cache into fixed-size blocks (pages) and maintains a per-sequence block table mapping logical positions to physically non-contiguous blocks, exactly like OS virtual memory. Memory is allocated on demand as tokens are generated, near-eliminating waste (vLLM reports <4% vs 60–80% before). This packs many more concurrent sequences into HBM, raising batch size and throughput, and it enables copy-free prefix sharing: identical prompt prefixes (system prompts, beams) point to the same physical blocks via copy-on-write.
#paged-attention#vllm#fragmentation#throughput
Advancedconcept
Compare GPTQ and AWQ as 4-bit post-training quantization methods. What is the core algorithmic idea of each, and when would you choose one over the other?
Both are calibration-based PTQ targeting W4. GPTQ quantizes weights column-by-column using approximate second-order (Hessian) information from calibration activations, updating remaining weights to compensate for each rounding error (an OBQ/OBS-style greedy solve) — accurate but calibration- and runtime-heavier. AWQ (Activation-aware Weight Quantization) observes that a small fraction (~1%) of weight channels are salient — the ones multiplied by large-magnitude activations. Rather than mixed precision, it applies a per-channel scale that protects those salient channels before uniform INT4 rounding, which is cheaper, calibration-light, and generalizes well. In practice AWQ often gives slightly better accuracy/robustness and faster quantization; GPTQ is mature with broad kernel support (e.g. Marlin). Choose based on your runtime's kernel support and measured perplexity/task accuracy.
#gptq#awq#ptq#4-bit
Advancedmath
Explain how FlashAttention achieves exact attention with O(N) instead of O(N^2) HBM memory, and why it is faster despite doing the same FLOPs.
Standard attention materializes the full $N\times N$ score matrix $QK^T$ in HBM, then softmax, then multiplies by $V$ — so it is dominated by slow HBM reads/writes of an $O(N^2)$ matrix, i.e. memory-bound. FlashAttention is IO-aware: it tiles Q, K, V into blocks that fit in fast on-chip SRAM and computes attention block-by-block, using the online (streaming) softmax trick — maintaining a running max and running normalizer/rescale — so it never instantiates the full score matrix. It only writes the $O(N)$ output (and small softmax stats) to HBM. The FLOPs are identical, but it slashes HBM traffic and the memory footprint to linear, so it runs faster and enables longer sequences; recomputation in the backward pass avoids storing the matrix.
#flash-attention#online-softmax#io-aware#sram
Advancedsystem-design
Distinguish tensor parallelism from pipeline parallelism for serving a model too large for one GPU. What does each cost in communication and latency, and when do you use which?
Tensor parallelism (TP) shards each layer's weight matrices across GPUs; every layer's matmul is split and results combined with an all-reduce (typically two per transformer block). It cuts per-GPU memory and per-token latency but is communication-heavy, so it needs a high-bandwidth interconnect (NVLink) and is usually kept within a node. Pipeline parallelism (PP) places different layer ranges on different GPUs; activations pass between stages with cheap point-to-point sends, tolerating slower links (across nodes), but a request traverses stages serially, adding latency and creating pipeline bubbles unless many micro-batches keep stages busy. Use TP intra-node to reduce latency/memory; add PP across nodes when the model exceeds a node and you can batch enough to fill the pipeline. They compose (3D with data parallel).
GPTQ, AWQ, and SmoothQuant all do weight/activation quantization but handle outliers very differently. Compare their core strategies and where each shines.
GPTQ is weight-only PTQ: layer-by-layer, it quantizes columns greedily and uses the inverse-Hessian (from a calibration set) to compensate remaining weights for each rounding error — minimizing output MSE, robust to 3–4 bit. AWQ is also weight-only but observes that ~1% of channels (those with large activations) dominate loss; it per-channel scales weights up (and activations down) to protect salient channels without mixed precision, then quantizes uniformly — calibration-light, fast, strong at 4-bit. SmoothQuant targets W8A8 (activations too): it migrates activation outliers into weights via a per-channel smoothing factor $s$, balancing both into easy-to-quantize ranges so INT8 GEMM works. GPTQ/AWQ for low-bit weight-only LLM serving; SmoothQuant for full INT8 throughput.
#gptq#awq#smoothquant#quantization#outliers
Expertconcept
A teammate quantizes weights to INT4 but leaves activations and the KV cache in FP16 and is surprised decode latency barely improves at batch size 1. Explain.
Decode at batch 1 is memory-bandwidth-bound: latency is dominated by streaming bytes from HBM, not arithmetic. INT4 weights do cut weight-read bytes ~4x, which helps if weights dominate the read — but the matmul still runs in FP16 (W4A16: weights dequantized on the fly to multiply FP16 activations), so there is no tensor-core compute speedup, and the dequant adds overhead. More importantly, at long context the FP16 KV-cache reads can dominate bandwidth, and quantizing only weights leaves that untouched. To actually speed up you need W4A16 with optimized fused kernels (Marlin) AND/OR KV-cache quantization (e.g. FP8/INT8 KV) so the bandwidth-limiting reads shrink too.
#int4#kv-cache-quant#bandwidth-bound#marlin
Expertsystem-design
Constrained/structured decoding (e.g. forcing valid JSON against a schema). How is it enforced at the token level, and what is the subtle risk it introduces to output quality?
At each step you compute a mask of which next tokens keep the output on a valid path — derived from a grammar/regex/JSON-schema compiled to a finite-state machine (or pushdown automaton). Logits for disallowed tokens are set to $-\infty$ before sampling, so the emitted string is guaranteed parseable. Efficient implementations (e.g. XGrammar, Outlines, llguidance) precompute per-state token masks to keep overhead small. The subtle risk: hard masking can force the model off its natural high-probability trajectory — e.g. requiring a key the model didn't 'intend', truncating reasoning, or steering it into low-quality completions because the only allowed tokens are improbable. It guarantees syntactic validity, not semantic correctness, and can degrade accuracy versus prompting + validation if the schema fights the model. Mitigate by allowing whitespace/ordering freedom and pairing with retries.
Estimate the cost per 1M output tokens for self-hosting a model, given an 8xH100 node at $2/GPU/hr that sustains 4,000 output tokens/sec. How does this framing explain why providers price input far cheaper than output?
Node cost is 8 GPUs × 2 USD = 16 USD/hr = 0.00444 USD/sec. At 4,000 tok/s that is 0.00444/4000 ≈ 1.11e-6 USD per token, i.e. ~1.11 USD per 1M output tokens at full utilization (real-world: divide by utilization, e.g. 50% → ~2.2 USD). Input is far cheaper because prefill is compute-bound and processes the whole prompt in parallel — thousands of input tokens per forward pass — so per-token GPU-time is tiny. Output tokens are produced one per decode step, each a full bandwidth-bound forward pass, consuming far more GPU-seconds per token. Hence the typical ~3–5x input/output price gap, and why long prompts are cheap but long generations expensive; prompt caching cuts input cost further by reusing prefill KV.
#cost-per-token#economics#prefill#utilization
Expertconcept
Why is the usable context window in production often shorter than a model's advertised maximum, and name three distinct mechanisms that constrain or degrade long-context performance.
Advertised length is what the model can attend to; it is neither free nor uniformly effective. (1) Memory: KV cache grows linearly with length and concurrency, so at the max window you serve far fewer requests — operators cap context to protect throughput/SLOs. (2) Quality degradation: 'lost in the middle' — models retrieve facts near the start/end far better than the middle, and effective context (measured by needle-in-haystack/RULER) is often well below the nominal limit; positional extrapolation (RoPE scaling, YaRN) beyond training length further erodes accuracy. (3) Latency and cost: prefill attention is roughly $O(N^2)$ (mitigated but not erased by FlashAttention), so TTFT and dollar cost climb with prompt length. Together these mean teams budget context far below the ceiling.
#context-window#lost-in-the-middle#rope#kv-cache
Expertmath
Derive the speculative-decoding acceptance/rejection rule and prove it leaves the target distribution unchanged. What happens on rejection?
For a draft token $x$ from $q(x)$, accept with probability $\min(1, p(x)/q(x))$ where $p$ is the target. If rejected, resample from the residual $p'(x)=\frac{\max(0,p(x)-q(x))}{\sum_x \max(0,p(x)-q(x))}$. Proof: the probability of finally emitting $x$ is $q(x)\min(1,p(x)/q(x)) + (\text{P(reject)})p'(x)$. The first term is $\min(q(x),p(x))$; the rejection mass equals $\sum \max(0,p-q)$ and renormalizes $p'$ so the second term is $\max(0,p(x)-q(x))$. Sum $=\min(p,q)+\max(0,p-q)=p(x)$. So emitted tokens are exactly target-distributed — lossless. On rejection, all subsequent drafted tokens are discarded.
A serving team wants 4-bit weights with minimal accuracy loss and no calibration headaches, but later needs CPU/edge deployment. How do AWQ, GPTQ, and GGUF differ in tradeoffs for this path, and what determines speculative-decoding's benefit on top?
AWQ needs only light activation-statistics calibration (no backprop/Hessian), is fast to produce, and generalizes well across domains since it protects salient channels rather than overfitting a calibration set — good default for 4-bit GPU serving. GPTQ can edge out AWQ on perplexity but its Hessian fit risks calibration-set overfit and is slower to quantize. GGUF (llama.cpp) is a container with k-quant schemes (Q4_K_M etc.) optimized for CPU/Metal/edge with mixed per-block bit allocation — the right choice for the CPU/edge leg. Speculative decoding's gain is orthogonal and governed by acceptance rate $\alpha$: expected tokens per target pass $\approx (1-\alpha^{k+1})/(1-\alpha)$, so it pays only when the draft (often a smaller quant of the same family) tracks the target well; low $\alpha$ or expensive drafts erase it.