Q: What are "weights" really — where do those numbers come from?

Weights are just the numbers inside a layer's matrix, and they're what the model learned during training. Before training they're random; training nudges each one until the network's outputs get good. When people say a model has "7 billion parameters," they mean roughly 7 billion weight numbers spread across all its matrices. So a trained LLM is, concretely, a huge pile of saved matrices full of these learned numbers — that file you download IS the weights.

Q: What is a transpose?

Transpose flips a matrix over its diagonal: rows become columns and columns become rows. [[1, 2, 3], [4, 5, 6]] (shape 2 x 3) transposed is [[1, 4], [2, 5], [3, 6]] (shape 3 x 2). In code, the transposed cell t[j][i] equals the original m[i][j]. Nothing is added or lost — it's the same numbers re-laid-out. You'll reach for it constantly to make shapes line up for a multiply.

Q: What is an embedding, as an array?

An embedding is just an array of numbers that captures the meaning of something — a word, token, sentence, or image. The token "cat" might become [0.21, -0.6, 0.05, ...] with, say, 768 numbers. Similar meanings land at nearby arrays. The model turns text into these arrays first, then does all its matrix math on them. So an embedding is the bridge from "words" to "numbers a layer can multiply" — meaning, stored as coordinates. This is also what powers RAG: compare embedding arrays to find related text.

Q: What does matrix-times-vector actually compute, and why is it a neural net layer?

A layer takes an input vector and produces an output vector by output = weights * input + bias. The weights are a matrix the model learned during training; each output number is one dot product of a weights row with the input, plus a bias number. So a 2 x 3 weights matrix turns a length-3 input into a length-2 output. That single operation — matrix times vector plus bias — is literally what one dense (fully-connected) layer does. Stack many and you have a network.

Q: Walk me through a tiny matrix-times-vector with real numbers.

Take weights W = [[1, 0, 2], [3, 1, 0]] (shape 2 x 3), input x = [4, 5, 6], bias b = [10, 20]. Output row 0: 1*4 + 0*5 + 2*6 = 4 + 0 + 12 = 16, then + 10 = 26. Output row 1: 3*4 + 1*5 + 0*6 = 12 + 5 + 0 = 17, then + 20 = 37. So output = [26, 37]. A length-3 input became a length-2 output — that's one layer firing.

Q: How does matrix-times-matrix work — what's the rule for each output cell?

Each output cell is the dot product of a row from the first matrix and a column from the second. Output cell at [i][j] = dot of row i of A with column j of B. So if A is 2 x 3 and B is 3 x 2, you fill a 2 x 2 result, and each of the 4 cells is one dot product over 3 numbers. In code it's three nested loops: outer over rows of A, middle over columns of B, inner doing the dot product.

Q: Show me a full 2x2 times 2x2 multiply with real numbers.

A = [[1, 2], [3, 4]], B = [[5, 6], [7, 8]]. Top-left = row0 dot col0 = 1*5 + 2*7 = 5 + 14 = 19. Top-right = row0 dot col1 = 1*6 + 2*8 = 6 + 16 = 22. Bottom-left = row1 dot col0 = 3*5 + 4*7 = 15 + 28 = 43. Bottom-right = row1 dot col1 = 3*6 + 4*8 = 18 + 32 = 50. Result = [[19, 22], [43, 50]]. Four cells, four dot products.

Question 1

What is a matrix, in terms I already know as a web dev?

Accepted Answer

A matrix is just a 2D array — an array of arrays. Each inner array is a row. So [[1, 2, 3], [4, 5, 6]] is a matrix with 2 rows and 3 columns, exactly like a small spreadsheet or a grid you'd render as an HTML table. A single array like [1, 2, 3] is a vector (one row). That's the whole idea: a matrix is a grid of numbers you can loop over with two indices, m[row][col].

Question 2

What does the "shape" of a matrix mean?

Accepted Answer

Shape is just rows-by-columns, written rows x columns. The matrix [[1, 2, 3], [4, 5, 6]] has 2 rows and 3 columns, so its shape is 2 x 3. In code it's m.length (rows) by m[0].length (columns). Shape is the single most important thing to track in LLM math — almost every error you'll hit is a shape mismatch, the same way a wrong array length breaks a loop. Get in the habit of saying the shape out loud.

Question 3

What is a dot product, and how do I compute one?

Accepted Answer

A dot product multiplies two equal-length arrays position-by-position, then sums the results into a single number. For a = [1, 2, 3] and b = [4, 5, 6]: 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32. In code it's a one-line loop: let s = 0; for (let i = 0; i < a.length; i++) s += a[i] * b[i];. Hold onto this — the dot product is the atom that every matrix multiply and every attention score is built from.

Question 4

What are "weights" really — where do those numbers come from?

Accepted Answer

Weights are just the numbers inside a layer's matrix, and they're what the model learned during training. Before training they're random; training nudges each one until the network's outputs get good. When people say a model has "7 billion parameters," they mean roughly 7 billion weight numbers spread across all its matrices. So a trained LLM is, concretely, a huge pile of saved matrices full of these learned numbers — that file you download IS the weights.

Question 5

What is a transpose?

Accepted Answer

Transpose flips a matrix over its diagonal: rows become columns and columns become rows. [[1, 2, 3], [4, 5, 6]] (shape 2 x 3) transposed is [[1, 4], [2, 5], [3, 6]] (shape 3 x 2). In code, the transposed cell t[j][i] equals the original m[i][j]. Nothing is added or lost — it's the same numbers re-laid-out. You'll reach for it constantly to make shapes line up for a multiply.

Question 6

What is an embedding, as an array?

Accepted Answer

An embedding is just an array of numbers that captures the meaning of something — a word, token, sentence, or image. The token "cat" might become [0.21, -0.6, 0.05, ...] with, say, 768 numbers. Similar meanings land at nearby arrays. The model turns text into these arrays first, then does all its matrix math on them. So an embedding is the bridge from "words" to "numbers a layer can multiply" — meaning, stored as coordinates. This is also what powers RAG: compare embedding arrays to find related text.

Question 7

What does matrix-times-vector actually compute, and why is it a neural net layer?

Accepted Answer

A layer takes an input vector and produces an output vector by output = weights * input + bias. The weights are a matrix the model learned during training; each output number is one dot product of a weights row with the input, plus a bias number. So a 2 x 3 weights matrix turns a length-3 input into a length-2 output. That single operation — matrix times vector plus bias — is literally what one dense (fully-connected) layer does. Stack many and you have a network.

Question 8

Walk me through a tiny matrix-times-vector with real numbers.

Accepted Answer

Take weights W = [[1, 0, 2], [3, 1, 0]] (shape 2 x 3), input x = [4, 5, 6], bias b = [10, 20]. Output row 0: 1*4 + 0*5 + 2*6 = 4 + 0 + 12 = 16, then + 10 = 26. Output row 1: 3*4 + 1*5 + 0*6 = 12 + 5 + 0 = 17, then + 20 = 37. So output = [26, 37]. A length-3 input became a length-2 output — that's one layer firing.

Question 9

How does matrix-times-matrix work — what's the rule for each output cell?

Accepted Answer

Each output cell is the dot product of a row from the first matrix and a column from the second. Output cell at [i][j] = dot of row i of A with column j of B. So if A is 2 x 3 and B is 3 x 2, you fill a 2 x 2 result, and each of the 4 cells is one dot product over 3 numbers. In code it's three nested loops: outer over rows of A, middle over columns of B, inner doing the dot product.

Question 10

Show me a full 2x2 times 2x2 multiply with real numbers.

Accepted Answer

A = [[1, 2], [3, 4]], B = [[5, 6], [7, 8]]. Top-left = row0 dot col0 = 1*5 + 2*7 = 5 + 14 = 19. Top-right = row0 dot col1 = 1*6 + 2*8 = 6 + 16 = 22. Bottom-left = row1 dot col0 = 3*5 + 4*7 = 15 + 28 = 43. Bottom-right = row1 dot col1 = 3*6 + 4*8 = 18 + 32 = 50. Result = [[19, 22], [43, 50]]. Four cells, four dot products.

Question 11

What's the shape rule for multiplying two matrices?

Accepted Answer

The inner dimensions must match: an m x n times an n x p gives an m x p. The two ns in the middle have to be equal — that's the length of every dot product. The outer numbers m and p become the result's shape. Example: 2 x 3 times 3 x 4 works and gives 2 x 4; but 2 x 3 times 2 x 4 errors, because 3 doesn't equal 2. It's the math version of "array lengths must line up."

Question 12

Why is almost all of a neural net / transformer just matrix multiplications?

Accepted Answer

Because every layer's core job is output = weights * input (plus a bias and a simple nonlinearity squashing each number). A transformer stacks these: attention is matrix multiplies (Q dot K-transpose, then times V), and the feed-forward blocks are matrix multiplies too. So running an LLM is mostly doing one big matrix multiply after another, dozens of times. If you understand matrix multiply, you understand the load-bearing 90% of what the model is doing when it generates a token.

Question 13

Why does a bigger matrix mean a more capable but more expensive model?

Accepted Answer

A bigger matrix has more cells, and each cell is a learned weight — a parameter. More parameters give the model more room to store patterns, so it can be more capable. But every extra cell is extra multiply-add work at every step and extra memory to hold, so bigger means slower and pricier to run. A 1000 x 1000 layer has a million weights; doubling each side to 2000 x 2000 gives four million — capacity and cost both jump.

Question 14

What does the bias do, and could a layer work without it?

Accepted Answer

The bias is a small array, one number per output, added after the matrix multiply: output = weights * input + bias. It lets each output shift up or down independently of the inputs — like the intercept in y = m*x + b. Without it, an all-zero input could only ever produce zero output, which is limiting. It's cheap (just one number per output neuron) but gives the layer extra freedom to fit the data.

Question 15

Why do I keep getting "shape mismatch" errors, and how do I read them?

Accepted Answer

Because multiplication needs the first matrix's column count to equal the second's row count. An error like "cannot multiply (2x3) and (2x4)" is telling you 3 (inner of the left) doesn't equal 2 (inner of the right). The fix is to check shapes before multiplying, just like checking a.length === b.length before a paired loop. Often the cure is transposing one matrix so the inner numbers line up — which is exactly why attention transposes K.

Question 16

Why does attention compute Q times K-transpose instead of Q times K?

Accepted Answer

Q and K are the "query" and "key" matrices, both shaped tokens x d_k (d_k is the size of each token's key/query vector). To get a score between every token and every other token, you want a tokens x tokens grid. Q * K won't multiply — inner dims are d_k and tokens, which differ. Transposing K to d_k x tokens makes the inner dims both d_k, so Q * K-transpose gives tokens x tokens: each cell is one token's query dotted with another's key — an attention score.

Question 17

Why are GPUs so much faster than CPUs for AI?

Accepted Answer

Every cell of a matrix-multiply output is an independent dot product — none depends on another, so they can all be computed at the same time. That's "embarrassingly parallel." A CPU has a handful of cores doing cells almost one-by-one; a GPU has thousands of tiny cores doing thousands of multiply-add cells in parallel. Since an LLM is just stacked matrix multiplies, a GPU finishes them far faster. It's not that GPUs are smarter — they just do the same simple work massively in parallel.

Question 18

What does "batching" mean and how does it show up in the math?

Accepted Answer

Batching is processing many inputs at once by stacking them as extra rows. One input is a 1 x n row; stack 32 of them into a 32 x n matrix and one matrix multiply against the weights handles all 32 at the same time — the result has 32 rows out. It's like running your loop body on a whole array in one shot instead of one item per iteration. GPUs love this: more rows fill more of those parallel cores.

Question 19

How does the embedding lookup itself use a matrix?

Accepted Answer

There's an embedding matrix shaped vocab_size x d — one row per possible token, each row being that token's learned vector of length d. "Looking up" a token is just grabbing its row by index, like embeddingMatrix[tokenId]. No multiply needed for the lookup — it's array indexing. But that matrix is huge (tens of thousands of rows), so it's a big chunk of the model's parameters, and it's learned just like any other weights.

Question 20

How do I count the parameters in a single layer?

Accepted Answer

For a dense layer it's just the weights matrix size plus the biases: rows * columns + rows. A layer mapping a length-512 input to a length-256 output has a 256 x 512 weights matrix = 256 * 512 = 131072 weights, plus 256 biases, so 131328 parameters. Summing this across every layer (and the embedding matrix) gives the model's headline parameter count. It's literally counting cells in arrays.

Question 21

After Q times K-transpose gives scores, what turns them into attention weights?

Accepted Answer

Softmax. The score row for one token is a list of raw numbers; softmax turns it into positive fractions that add to 1 — how much this token should "pay attention" to each other token. For scores [2, 1, 0]: exponentiate to about [7.39, 2.72, 1.0], sum about 11.11, divide to get about [0.67, 0.24, 0.09]. Those weights then multiply V (the values) to blend information. Same softmax also turns final scores into next-token probabilities.

Question 22

Is the order of matrix multiplication important — is A times B the same as B times A?

Accepted Answer

No — matrix multiply is not commutative. A * B and B * A usually differ, and often one of them won't even be a legal shape. With A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]], A * B = [[19, 22], [43, 50]] but B * A = [[23, 34], [31, 46]] — different numbers. So in a network the layer order is real: input flows through layer 1, then layer 2, and you can't swap them. Unlike adding numbers, position matters here.

Matrices & How a Layer Computes