Softmax & the Attention Formula, Demystified

Softmax & the Attention Formula, Demystified — explained simply for developers.

Learn this interactively →
Basicsconcept

What is a "logit" or raw score in an LLM, before softmax touches it?

A logit is just a plain number the model spits out for each option, before any cleanup. Think of an array like [2.0, 1.0, 0.1] where each slot is one possible next word and the number is "how much I like this one." These numbers can be negative, huge, or tiny, and they do NOT add up to anything nice. They're raw opinions. Softmax is the step that turns this messy array into clean probabilities you can actually use.
#logits#softmax#basics#next-token
Basicsconcept

What does softmax actually do, in plain terms?

Softmax takes an array of raw scores (logits) and turns them into probabilities: positive numbers that all add up to 1. The recipe is two steps. First, exponentiate every score with exp() so everything becomes positive (and bigger scores get exaggerated). Then divide each by the total of all the exponentials so the whole array sums to 1. So softmax(x)[i] = exp(x[i]) / sum(exp(x[j]) for all j). Output is a clean probability distribution you can sample from.
#softmax#probability#basics#normalization
Basicsconcept

How does softmax power next-token prediction in a chatbot?

At each step the model produces one logit for EVERY token in its vocabulary — that's an array maybe 100,000 long, one score per possible next word-piece. Softmax turns that giant array into probabilities summing to 1. Then the model samples one token from that distribution (like a weighted random pick), appends it, and repeats for the next word. So softmax is literally the step that converts "raw opinions about every word" into "a real probability I can roll the dice on."
#softmax#next-token#vocabulary#sampling
Basicsconcept

What are the three vectors Q, K, and V that every token gets in attention?

For each token (each word-piece), attention builds three arrays of numbers. Query (Q) = "what am I looking for?" Key (K) = "what do I offer/what am I about?" Value (V) = "the actual content I'll hand over if you pick me." They're made by multiplying the token's embedding by three learned matrices. The classic analogy: Q is your search box text, K is each document's keywords, V is the document's body. Attention matches queries against keys, then pulls the matching values.
#attention#qkv#vectors#basics
Basicsconcept

What is multi-head attention, briefly?

Instead of doing attention once, you do it several times in parallel — each "head" has its own learned Q/K/V matrices, so each head can focus on a different kind of relationship (one head tracks grammar, another tracks which noun a pronoun refers to, etc.). You run all heads, then concatenate their output vectors back into one and mix them with a final matrix. It's like running several specialist search engines over the same text and stapling their findings together.
#multi-head#attention#parallel#basics
Core ideaconcept

Why exponentiate the scores in softmax instead of just dividing each score by the total?

Two reasons. First, raw logits can be negative, and you can't have a negative probability — exp() makes everything positive no matter what. Second, exp() exaggerates gaps: a score that's a bit higher becomes a LOT more likely, which is usually what you want (the model leans toward its favorite). If you just divided raw scores by their sum, negatives would break it and the differences would stay flat and boring. Exponentiating is the trick that makes softmax behave.
#softmax#exp#why#intuition
Core ideahow-to

Can you walk me through a tiny softmax by hand on `[2.0, 1.0, 0.1]`?

Sure. Exponentiate each: exp(2.0)=7.389, exp(1.0)=2.718, exp(0.1)=1.105. Add them: 7.389 + 2.718 + 1.105 = 11.212. Now divide each by that total: 7.389/11.212=0.659, 2.718/11.212=0.242, 1.105/11.212=0.099. So softmax gives about [0.659, 0.242, 0.099]. Check: they're all positive and sum to 1.0. The biggest logit (2.0) grabbed 66% of the probability — exactly the "favor the favorite" behavior.
#softmax#worked-example#arithmetic#probability
Core ideahow-to

I learned about temperature already — where does it plug into the softmax?

Temperature is a single number you divide every logit by BEFORE softmax. So you compute softmax(logits / temperature). With your [2.0, 1.0, 0.1]: at temperature 0.5 you'd softmax [4.0, 2.0, 0.2] (gaps stretched → spikier, more confident), and at temperature 2.0 you'd softmax [1.0, 0.5, 0.05] (gaps squished → flatter, more random). It's just a rescale of the inputs. Softmax does the same exp-and-divide; temperature only changes how dramatic the gaps are first.
#temperature#softmax#logits#sampling
Core ideaconcept

Why does low temperature make output "greedy" and high temperature make it "random"?

Because dividing by a small temperature (like 0.2) blows up the gaps between logits, so after exp() the top choice dominates and softmax gives it almost all the probability — the model basically always picks its favorite (greedy, repetitive). Dividing by a big temperature (like 2.0) shrinks the gaps, so the probabilities flatten toward equal — now even unlikely tokens get a real shot (creative, sometimes nonsense). Temperature literally controls how peaked vs. flat the post-softmax distribution is.
#temperature#greedy#softmax#intuition
Core ideaconcept

How are the attention scores computed from Q and K?

You take each token's Query and dot it with every token's Key. A dot product of two arrays is dot(a,b) = a[0]*b[0] + a[1]*b[1] + ..., and a big dot product means the two vectors point the same way — i.e. high relevance. Doing this for all query-key pairs at once is the matrix multiply Q times K-transpose (written Q · K^T). The result is a grid of scores: row i, column j = "how relevant is token j to token i." These scores are the attention logits.
#attention#dot-product#QK#scores
Core ideaconcept

What does softmax do to each row of the attention score grid?

After scaling, you softmax each row on its own. A row holds one token's relevance scores against all tokens; softmax turns that row into weights that are all positive and sum to 1 — "of all the tokens I could pay attention to, here's the fraction of my focus each one gets." These are the attention WEIGHTS. Same exp-and-divide you saw for next-token prediction, just applied per row. A weight near 1 means "I'm mostly looking at that token."
#softmax#attention-weights#rows#focus
Core ideaconcept

How do the attention weights and the Value vectors combine into the output?

Each token's output is a weighted blend of all the Value vectors, using its attention weights as the mix. If a token's weights are [0.7, 0.2, 0.1] over three tokens, its output is 0.7*V1 + 0.2*V2 + 0.1*V3. So it mostly grabs the content of the token it focused on, plus a splash of the others. Done for all tokens at once, that's the matrix multiply weights times V. Output is "a custom summary of the context, tilted toward what each token cared about."
#attention#values#weighted-sum#output
Core ideahow-to

What is the whole attention recipe, start to finish, in one walkthrough?

Five steps. 1) Each token gets Q, K, V arrays from its embedding. 2) Score every pair with dot products: Q times K^T → an n-by-n grid of relevance. 3) Divide by the square root of d_k to keep numbers tame. 4) Softmax each row → attention weights summing to 1. 5) Multiply weights by V → each token's output is a weighted blend of everyone's Values. In short: score, scale, softmax, blend. The output is every token re-expressed in terms of what it found relevant in the context.
#attention#recipe#qkv#overview
Core ideaconcept

How does stacking attention get you an actual GPT or Claude?

One attention step lets every token peek at the others once. But meaning is layered — you stack dozens of attention-plus-feedforward blocks, each refining the token vectors a bit more, so deep layers can capture grammar, then meaning, then intent. That whole stack is a Transformer. Feed it tokens, run all the layers, softmax the final logits, sample the next token, repeat — that loop IS GPT and Claude. There's no extra magic ingredient; it's this recipe scaled to billions of numbers.
#transformer#stacking#gpt#claude
Hands-ongotcha

What happens at temperature exactly 0?

Mathematically dividing by 0 blows up, so in practice temperature 0 is treated as "pure greedy": skip the random sampling and just pick the single highest logit every time — the argmax of the array. Same prompt gives the same answer, fully deterministic. You can think of it as the limit of shrinking temperature toward 0, where one token's probability approaches 1.0 and all others approach 0. That's why this project pins temperature 0 when it wants reproducible recommendations.
#temperature#argmax#deterministic#gotcha
Hands-ongotcha

Why is it written Q times K-transpose (Q · K^T) instead of just Q times K?

Because of shapes — the same rule that errors out in numpy. Say you have n tokens and each Q/K vector has length d_k. Stacked, Q is n by d_k and K is n by d_k. To multiply them you need the inner dimensions to match, but d_k is on the wrong side. Transposing K flips it to d_k by n, so Q (n by d_k) times K^T (d_k by n) gives an n by n score grid — one score for every token pair. Transpose is just the shape fix.
#transpose#shapes#matmul#QK
Hands-onconcept

Why divide the attention scores by sqrt(d_k) before softmax?

Here d_k means the length of each Query/Key array (the dimension of those vectors). When you dot two long vectors, you're summing many products, so the totals can get large just because the vectors are long — not because relevance is genuinely high. Big numbers make softmax too spiky (one token hogs everything) and training unstable. Dividing by the square root of d_k scales the scores back to a reasonable range. It's pure intuition-level housekeeping to keep the numbers tame — not a deep derivation you need to memorize.
#scaling#sqrt-dk#softmax#stability
Hands-onhow-to

Can you show me a tiny end-to-end attention blend with real numbers?

Say one token's softmaxed weights over 3 tokens are [0.7, 0.2, 0.1] and the Values are 2-long arrays: V1=[1,0], V2=[0,1], V3=[1,1]. Output = 0.7*[1,0] + 0.2*[0,1] + 0.1*[1,1]. Component 0: 0.7*1 + 0.2*0 + 0.1*1 = 0.8. Component 1: 0.7*0 + 0.2*1 + 0.1*1 = 0.3. So output =[0.8, 0.3] — mostly V1, lightly tinted by V2 and V3. That's attention producing one context-aware vector.
#worked-example#weighted-sum#values#arithmetic
Hands-ondecision

Softmax shows up in two different places in an LLM — what are they, and are they the same operation?

Yes, identical operation, two jobs. Inside attention, softmax turns each row of relevance scores into attention WEIGHTS (how much focus each token gets), summing to 1. At the very end, softmax turns the final logits over the whole vocabulary into next-token PROBABILITIES, also summing to 1. Same exp()-then-divide math both times. The only difference is what the array means: "focus distribution over context tokens" vs. "probability distribution over possible next words."
#softmax#attention-weights#next-token#comparison
Hands-onconcept

How does attention secretly reuse dot products, softmax, and matrix multiply all at once?

It's literally the other two topics stacked. The dot products (Q times K^T) measure similarity — same operation behind cosine similarity and semantic search. Softmax turns those similarities into clean weights that sum to 1. Then a matrix multiply (weights times V) does the weighted sum — the same nested-loop-of-dot-products that any neural layer runs on a GPU. So attention = similarity + weighting + weighted blend. If those three felt separate before, attention is where they finally click together.
#attention#dot-product#softmax#matmul
Gotchasgotcha

What's a common gotcha when computing softmax in real code?

exp() of a large logit overflows — exp(1000) is Infinity, and Infinity / Infinity gives NaN, silently poisoning your output. The standard fix is the "max trick": subtract the largest logit from every logit before exponentiating. So you compute exp(x[i] - max(x)) instead of exp(x[i]). The result is mathematically identical (the constant cancels in the divide) but the numbers stay small and safe. Every real softmax implementation does this — you rarely write raw exp(x).
#softmax#overflow#numerical-stability#gotcha
Gotchasgotcha

Does softmax ever give a token exactly 0 probability?

No — and that's a feature. Because exp() of any finite number is strictly positive, every option ends up with a tiny but nonzero probability after dividing. So even the model's least-favorite next word has a sliver of a chance (unless you force greedy/temperature-0 decoding). That's why high temperature can occasionally surface a surprising word: it was never truly impossible, just very unlikely. If you genuinely want to forbid tokens, you mask them to negative infinity BEFORE softmax so their exp() becomes 0.
#softmax#probability#masking#gotcha