Question 1

What is the fundamental difference between the prefill phase and the decode phase of autoregressive LLM inference, and why does each have a different performance bottleneck?

Accepted Answer

Prefill processes the entire prompt in one forward pass, computing all input tokens' attention in parallel and populating the KV cache. It is compute-bound (high arithmetic intensity, GPU FLOPs saturated). Decode then generates one token per forward pass, each attending to the cached keys/values; it is memory-bandwidth-bound because for batch size 1 you stream all weights (and the KV cache) from HBM to produce a single token, so FLOP utilization is tiny. This split is why TTFT (time-to-first-token) tracks prompt length while inter-token latency tracks model size and bandwidth, and why serving systems batch decode steps to amortize the weight reads.

Question 2

Explain greedy decoding, temperature, top-k, top-p (nucleus), and min-p sampling, and state precisely what min-p changes relative to top-p.

Accepted Answer

Greedy always picks \arg\max logit — deterministic, prone to repetition. Temperature T rescales logits (z/T) before softmax: T<1 sharpens, T>1 flattens. Top-k keeps the k highest-probability tokens then renormalizes. Top-p (nucleus) keeps the smallest set whose cumulative probability \geq p, so the candidate set size adapts to the distribution. Min-p instead sets a relative floor: keep tokens with P \geq p_{	ext{min}}\cdot P_{\max} (the top token's probability). When the model is confident (P_{\max} high) the threshold is high and the set is narrow; when uncertain it widens. This preserves coherence at high temperature better than fixed top-p, which can admit long low-probability tails.

Question 3

How do Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV-cache pressure, and what is the tradeoff against full Multi-Head Attention?

Accepted Answer

MHA stores separate K/V for every query head. MQA shares a single K/V head across all query heads, shrinking the KV cache (and the KV bandwidth read in decode) by roughly the head count — a large win since decode is bandwidth-bound. GQA interpolates: query heads are split into g groups, each sharing one K/V head, giving an n_q/g reduction. The tradeoff is representational capacity: MQA can degrade quality on some tasks; GQA (e.g. 8 KV heads) recovers nearly all MHA quality while keeping most of the memory/bandwidth savings, which is why modern models default to GQA. It also enables longer contexts at fixed memory.

Question 4

What is continuous (in-flight) batching and why does it dramatically outperform static batching for LLM serving?

Accepted Answer

Static batching forms a fixed batch, runs all requests to completion, then accepts new ones — so the whole batch waits on the longest generation and the GPU idles as short requests finish (head-of-line blocking, padding waste). Continuous batching schedules at the iteration (token) level: finished sequences are evicted the moment they emit EOS and waiting requests are admitted into the freed slots mid-flight, with ragged/no padding. Combined with chunked prefill it mixes a compute-bound prefill chunk and many bandwidth-bound decodes in one batch to keep the GPU saturated. The result is far higher throughput and lower average latency under realistic variable-length, bursty traffic.

Question 5

Contrast INT8, INT4, and FP8 weight quantization for inference as of 2026: precision, quality, and hardware. Why has FP8 become a practical sweet spot on recent GPUs?

Accepted Answer

INT8 (1 byte) halves memory vs BF16 with near-lossless quality and is well supported. INT4 (½ byte) quarters memory — letting a 70B model fit in ~35–40 GB — but needs careful per-group scaling/calibration (GPTQ/AWQ) to limit accuracy loss, and is favored for memory-constrained, single/low-batch decode. FP8 (1 byte) keeps a floating-point exponent, so it represents dynamic range and outliers far better than INT8 at the same width, and Hopper/Blackwell tensor cores execute FP8 matmuls natively at high speed. So FP8 W8A8 gives near-BF16 quality with both memory savings and a compute speedup, whereas INT4 saves the most memory but is bandwidth-driven and quality-sensitive — making FP8 a default sweet spot on modern hardware.

Question 6

What is the GGUF format and the role of llama.cpp, and what problem class is it designed for versus a vLLM/TensorRT-LLM deployment?

Accepted Answer

GGUF is the single-file model container used by llama.cpp (successor to GGML), packing weights plus metadata/tokenizer and supporting a spectrum of quantization types (e.g. Q4_K_M, Q5_K, Q8_0) with mixed per-block precision. It targets local/edge and CPU-or-mixed-GPU inference: it mmaps weights, runs on commodity CPUs with optional partial GPU offload of layers, and is optimized for low-batch, single-user latency on consumer hardware. vLLM/TensorRT-LLM instead target datacenter GPUs and high-concurrency throughput with PagedAttention, continuous batching, and tensor parallelism. So GGUF/llama.cpp = portable single-user/edge inference; vLLM/TRT-LLM = multi-tenant server throughput. They optimize opposite ends of the batch-size/hardware spectrum.

Question 7

Define the key serving cost/performance metrics — TTFT, ITL/TPOT, throughput (tokens/s), goodput — and explain the throughput-vs-latency tension a serving config must trade off.

Accepted Answer

TTFT (time-to-first-token) is prefill latency, felt as responsiveness. ITL/TPOT (inter-token latency / time-per-output-token) is the streaming cadence during decode. Throughput is aggregate tokens/sec across all concurrent requests; goodput is throughput that meets SLOs (e.g. only requests under a TTFT/ITL target count). The tension: larger batches amortize weight reads and raise GPU utilization and throughput, but queueing and bigger forward passes raise per-request TTFT and ITL. Prioritizing prefill improves TTFT but stalls decodes (raising ITL); prioritizing decode does the reverse. You tune batch size, chunked-prefill token budget, and scheduling policy to maximize goodput — throughput subject to latency SLOs — rather than either metric alone.

Question 8

Explain the core mechanism of speculative decoding. Why does it speed up autoregressive generation when it does extra work (running both a draft and a target model)?

Accepted Answer

A small fast draft model autoregressively proposes k candidate tokens; the large target model then scores all k in a single parallel forward pass (one matmul over the whole block, since the tokens already exist). Speedup comes from converting k sequential target decode steps into one batched verification — memory-bandwidth-bound decoding is dominated by loading weights once per step, so verifying k tokens in one pass amortizes that load. Accepted tokens advance the sequence; on the first rejection you resample from a corrected distribution and discard the rest. Net win when draft is cheap and acceptance is high.

Question 9

Why does beam search rarely beat sampling for open-ended LLM generation despite maximizing sequence likelihood, and where is it still preferred?

Accepted Answer

Beam search approximately maximizes total sequence log-probability, but for open-ended text the highest-likelihood sequences are degenerate — bland, repetitive, prone to loops — because the model's most likely continuations collapse onto safe high-frequency phrases (the 'likelihood trap'/neural text degeneration). Human text is not the mode of the distribution; it occupies a higher-entropy region that sampling methods reach. Beam search also costs O(	ext{beam width}) more compute and breaks per-token streaming. It remains preferred in constrained, low-entropy tasks where one correct target dominates: machine translation, speech recognition, and short structured extraction, where maximizing likelihood aligns with correctness.

Question 10

Derive the memory footprint of the KV cache for a transformer and explain why it, not the weights, often limits how many concurrent requests a server can hold.

Accepted Answer

Per token you store a key and value vector in every layer across all KV heads. Per-token size = 2 	imes L 	imes n_{kv} 	imes d_{head} 	imes 	ext{bytes}, where L is layers, n_{kv} KV heads, d_{head} head dim. For a request multiply by sequence length S and batch B. Example: an MHA model with L=80, hidden 8192 (so n_{kv}\cdot d_{head}=8192), FP16 (2 bytes) is 2\cdot80\cdot8192\cdot2 \approx 2.6 MB per token, so an 8K-token request is ~21 GB. Weights are fixed and shared across all requests, but the KV cache grows linearly with B	imes S, so at long contexts and high concurrency the cache exhausts HBM first, capping batch size and thus throughput.

Question 11

Explain speculative decoding: how it works, why it preserves the target model's exact output distribution, and the conditions under which it gives no speedup.

Accepted Answer

A small cheap draft model proposes k tokens autoregressively; the large target model then verifies them in a single parallel forward pass. Each draft token is accepted with a probability derived from the ratio of target to draft probabilities; on the first rejection you resample from a corrected residual distribution and discard the rest. This rejection-sampling scheme is provably distribution-preserving — output is statistically identical to sampling from the target alone. Speedup comes from turning many memory-bound single-token decodes into one batched verification. It fails to help when draft acceptance is low (draft poorly matches target), when k is mistuned, or when the target is already compute-bound (large batch) with no spare parallelism; drafting overhead can then make it slower.

Question 12

What problem does PagedAttention (vLLM) solve, and how does borrowing the OS virtual-memory paging idea improve serving throughput?

Accepted Answer

Naive serving allocates a contiguous KV-cache buffer sized for each request's max possible length, causing internal and external fragmentation and reserving memory that is never used — so few requests fit. PagedAttention splits the KV cache into fixed-size blocks (pages) and maintains a per-sequence block table mapping logical positions to physically non-contiguous blocks, exactly like OS virtual memory. Memory is allocated on demand as tokens are generated, near-eliminating waste (vLLM reports <4% vs 60–80% before). This packs many more concurrent sequences into HBM, raising batch size and throughput, and it enables copy-free prefix sharing: identical prompt prefixes (system prompts, beams) point to the same physical blocks via copy-on-write.

Question 13

Compare GPTQ and AWQ as 4-bit post-training quantization methods. What is the core algorithmic idea of each, and when would you choose one over the other?

Accepted Answer

Both are calibration-based PTQ targeting W4. GPTQ quantizes weights column-by-column using approximate second-order (Hessian) information from calibration activations, updating remaining weights to compensate for each rounding error (an OBQ/OBS-style greedy solve) — accurate but calibration- and runtime-heavier. AWQ (Activation-aware Weight Quantization) observes that a small fraction (~1%) of weight channels are salient — the ones multiplied by large-magnitude activations. Rather than mixed precision, it applies a per-channel scale that protects those salient channels before uniform INT4 rounding, which is cheaper, calibration-light, and generalizes well. In practice AWQ often gives slightly better accuracy/robustness and faster quantization; GPTQ is mature with broad kernel support (e.g. Marlin). Choose based on your runtime's kernel support and measured perplexity/task accuracy.

Question 14

Explain how FlashAttention achieves exact attention with O(N) instead of O(N^2) HBM memory, and why it is faster despite doing the same FLOPs.

Accepted Answer

Standard attention materializes the full N	imes N score matrix QK^T in HBM, then softmax, then multiplies by V — so it is dominated by slow HBM reads/writes of an O(N^2) matrix, i.e. memory-bound. FlashAttention is IO-aware: it tiles Q, K, V into blocks that fit in fast on-chip SRAM and computes attention block-by-block, using the online (streaming) softmax trick — maintaining a running max and running normalizer/rescale — so it never instantiates the full score matrix. It only writes the O(N) output (and small softmax stats) to HBM. The FLOPs are identical, but it slashes HBM traffic and the memory footprint to linear, so it runs faster and enables longer sequences; recomputation in the backward pass avoids storing the matrix.

Question 15

Distinguish tensor parallelism from pipeline parallelism for serving a model too large for one GPU. What does each cost in communication and latency, and when do you use which?

Accepted Answer

Tensor parallelism (TP) shards each layer's weight matrices across GPUs; every layer's matmul is split and results combined with an all-reduce (typically two per transformer block). It cuts per-GPU memory and per-token latency but is communication-heavy, so it needs a high-bandwidth interconnect (NVLink) and is usually kept within a node. Pipeline parallelism (PP) places different layer ranges on different GPUs; activations pass between stages with cheap point-to-point sends, tolerating slower links (across nodes), but a request traverses stages serially, adding latency and creating pipeline bubbles unless many micro-batches keep stages busy. Use TP intra-node to reduce latency/memory; add PP across nodes when the model exceeds a node and you can batch enough to fill the pipeline. They compose (3D with data parallel).

Question 16

GPTQ, AWQ, and SmoothQuant all do weight/activation quantization but handle outliers very differently. Compare their core strategies and where each shines.

Accepted Answer

GPTQ is weight-only PTQ: layer-by-layer, it quantizes columns greedily and uses the inverse-Hessian (from a calibration set) to compensate remaining weights for each rounding error — minimizing output MSE, robust to 3–4 bit. AWQ is also weight-only but observes that ~1% of channels (those with large activations) dominate loss; it per-channel scales weights up (and activations down) to protect salient channels without mixed precision, then quantizes uniformly — calibration-light, fast, strong at 4-bit. SmoothQuant targets W8A8 (activations too): it migrates activation outliers into weights via a per-channel smoothing factor s, balancing both into easy-to-quantize ranges so INT8 GEMM works. GPTQ/AWQ for low-bit weight-only LLM serving; SmoothQuant for full INT8 throughput.

Question 17

A teammate quantizes weights to INT4 but leaves activations and the KV cache in FP16 and is surprised decode latency barely improves at batch size 1. Explain.

Accepted Answer

Decode at batch 1 is memory-bandwidth-bound: latency is dominated by streaming bytes from HBM, not arithmetic. INT4 weights do cut weight-read bytes ~4x, which helps if weights dominate the read — but the matmul still runs in FP16 (W4A16: weights dequantized on the fly to multiply FP16 activations), so there is no tensor-core compute speedup, and the dequant adds overhead. More importantly, at long context the FP16 KV-cache reads can dominate bandwidth, and quantizing only weights leaves that untouched. To actually speed up you need W4A16 with optimized fused kernels (Marlin) AND/OR KV-cache quantization (e.g. FP8/INT8 KV) so the bandwidth-limiting reads shrink too.

Question 18

Constrained/structured decoding (e.g. forcing valid JSON against a schema). How is it enforced at the token level, and what is the subtle risk it introduces to output quality?

Accepted Answer

At each step you compute a mask of which next tokens keep the output on a valid path — derived from a grammar/regex/JSON-schema compiled to a finite-state machine (or pushdown automaton). Logits for disallowed tokens are set to -\infty before sampling, so the emitted string is guaranteed parseable. Efficient implementations (e.g. XGrammar, Outlines, llguidance) precompute per-state token masks to keep overhead small. The subtle risk: hard masking can force the model off its natural high-probability trajectory — e.g. requiring a key the model didn't 'intend', truncating reasoning, or steering it into low-quality completions because the only allowed tokens are improbable. It guarantees syntactic validity, not semantic correctness, and can degrade accuracy versus prompting + validation if the schema fights the model. Mitigate by allowing whitespace/ordering freedom and pairing with retries.

Question 19

Estimate the cost per 1M output tokens for self-hosting a model, given an 8xH100 node at $2/GPU/hr that sustains 4,000 output tokens/sec. How does this framing explain why providers price input far cheaper than output?

Accepted Answer

Node cost is 8 GPUs × 2 USD = 16 USD/hr = 0.00444 USD/sec. At 4,000 tok/s that is 0.00444/4000 ≈ 1.11e-6 USD per token, i.e. ~1.11 USD per 1M output tokens at full utilization (real-world: divide by utilization, e.g. 50% → ~2.2 USD). Input is far cheaper because prefill is compute-bound and processes the whole prompt in parallel — thousands of input tokens per forward pass — so per-token GPU-time is tiny. Output tokens are produced one per decode step, each a full bandwidth-bound forward pass, consuming far more GPU-seconds per token. Hence the typical ~3–5x input/output price gap, and why long prompts are cheap but long generations expensive; prompt caching cuts input cost further by reusing prefill KV.

Question 20

Why is the usable context window in production often shorter than a model's advertised maximum, and name three distinct mechanisms that constrain or degrade long-context performance.

Accepted Answer

Advertised length is what the model can attend to; it is neither free nor uniformly effective. (1) Memory: KV cache grows linearly with length and concurrency, so at the max window you serve far fewer requests — operators cap context to protect throughput/SLOs. (2) Quality degradation: 'lost in the middle' — models retrieve facts near the start/end far better than the middle, and effective context (measured by needle-in-haystack/RULER) is often well below the nominal limit; positional extrapolation (RoPE scaling, YaRN) beyond training length further erodes accuracy. (3) Latency and cost: prefill attention is roughly O(N^2) (mitigated but not erased by FlashAttention), so TTFT and dollar cost climb with prompt length. Together these mean teams budget context far below the ceiling.

Question 21

Derive the speculative-decoding acceptance/rejection rule and prove it leaves the target distribution unchanged. What happens on rejection?

Accepted Answer

For a draft token x from q(x), accept with probability \min(1, p(x)/q(x)) where p is the target. If rejected, resample from the residual p'(x)=\frac{\max(0,p(x)-q(x))}{\sum_x \max(0,p(x)-q(x))}. Proof: the probability of finally emitting x is q(x)\min(1,p(x)/q(x)) + (	ext{P(reject)})p'(x). The first term is \min(q(x),p(x)); the rejection mass equals \sum \max(0,p-q) and renormalizes p' so the second term is \max(0,p(x)-q(x)). Sum =\min(p,q)+\max(0,p-q)=p(x). So emitted tokens are exactly target-distributed — lossless. On rejection, all subsequent drafted tokens are discarded.

Question 22

A serving team wants 4-bit weights with minimal accuracy loss and no calibration headaches, but later needs CPU/edge deployment. How do AWQ, GPTQ, and GGUF differ in tradeoffs for this path, and what determines speculative-decoding's benefit on top?

Accepted Answer

AWQ needs only light activation-statistics calibration (no backprop/Hessian), is fast to produce, and generalizes well across domains since it protects salient channels rather than overfitting a calibration set — good default for 4-bit GPU serving. GPTQ can edge out AWQ on perplexity but its Hessian fit risks calibration-set overfit and is slower to quantize. GGUF (llama.cpp) is a container with k-quant schemes (Q4_K_M etc.) optimized for CPU/Metal/edge with mixed per-block bit allocation — the right choice for the CPU/edge leg. Speculative decoding's gain is orthogonal and governed by acceptance rate \alpha: expected tokens per target pass \approx (1-\alpha^{k+1})/(1-\alpha), so it pays only when the draft (often a smaller quant of the same family) tracks the target well; low \alpha or expensive drafts erase it.

LLM Inference, Decoding, Quantization & Serving