Attention, Calculated

Prelude

The ambiguity problem

A word in isolation is almost meaningless. Consider the word bank. Is it a river bank or a money bank? The word alone can't tell you—you have to look at its neighbors.

Self-attention is how Transformers let each word ask around. Every word in a sentence says to every other word: "Hey, how much should I care about you?" The answer comes back as a single number between 0 and 1. Then each word rebuilds itself as a weighted mix of its neighbors' meanings.

This page makes all of that explicit. By the end you'll have pushed raw scores through a softmax, watched a blended value vector form, and built the intuition for why Transformers beat every prior architecture at disambiguation.

Pick your sentence

We'll work through a handful of classic ambiguous words. Pick one and the whole page follows it:

The focus word bank needs to figure out whether it is water-related or money-related by listening to its neighbors.

Step 1

Three roles for every word

In a Transformer, every word is associated with three separate vectors, each playing a different role:

Query ($Q$): "What am I looking for?" This is the perspective of the word that wants to update itself.
Key ($K$): "What do I offer?" Every other word advertises its flavor here.
Value ($V$): "If you decide to attend to me, here is the actual meaning I contribute."

Queries and keys determine how much to attend. Values determine what gets mixed in. Keeping those two roles apart is what makes attention expressive.

The library analogy. Imagine walking into a library with a question written on a sticky note (your query). Every book has a spine label (key) and a bunch of text inside (value). You match your sticky note to the spine labels, pull out the books that fit, and read their text. That's one step of attention.

Step 2

Dot product = compatibility score

How do we measure how well a Query matches a Key? We use the dot product. If the two vectors point the same way, the score is large and positive. If they're perpendicular, zero. If they point opposite ways, negative.

Drag the orange Query arrow for the focus word, and the blue Key arrows for every other word. Watch the raw scores update live below. Drag the query so it aligns with one specific key and you'll see that word win.

2-D stand-in for the real high-dimensional Q/K space. All arrows are constrained to the unit circle.

Word	Query vector	Key vector	Dot product

Scaled dot product in real Transformers. In production, the dot product is divided by $\sqrt{d_k}$ (where $d_k$ is the key dimension). With large dimensions, raw dot products balloon and squash softmax to one-hot. The scale factor keeps the gradient healthy. We ignore it in 2-D because the effect is tiny.

Step 3

Softmax turns scores into percentages

Raw dot products aren't weights—they can be negative, or gigantic. We need them to behave like probabilities that sum to one. Softmax does exactly that:

Exponentiating amplifies differences: a score that's just a bit higher than the others rockets upward, dominating the weights. This is why softmax is called "soft" argmax—it picks a winner, but still passes non-zero signal to everyone else.

Step-by-step calculation (live)

Here's the full pipeline applied to your current dragged arrangement. Every number comes from the canvas above:

Word	Raw score	exp(score)	Weight (softmax)

Notice the winner-take-most behavior. Make one of your dot products visibly larger than the rest. You'll see its softmax weight jump above 80% while the others shrink to single digits. Now nudge another key vector toward the query— the new word steals a huge chunk of attention even for a tiny tilt. Small changes in scores mean large changes in weights.

Step 4

Blend the values into a new meaning

Now the payoff. The focus word rebuilds its meaning as a weighted sum of the Value vectors, with the softmax weights we just computed:

Below, the teal arrows are the value vectors for every word (drag them to simulate different "meanings"). The dashed orange arrow is $V_{\text{final}}$—the new contextual meaning for the focus word. Watch it slide toward whichever value has the highest attention weight.

Blend of value vectors by the current attention weights. Dashed orange: the output $V_{\text{final}}$.

Why keys and values are kept separate. At first glance you might ask: why not use the value as both the key and the thing to be averaged? Because then you couldn't express "attend to the word river because it's a water cue, but grab its broader geography meaning, not the raw token." Keys ask "does this match?"; values answer "here's what to contribute if I do."

Step 5

Run it on all five sentences

Here's one more look at the pipeline, this time fully automated. The table below evaluates the same Q, K, V vectors (the pre-set "sensible" ones for each ambiguous word) through the entire attention formula and shows which neighbor dominates. Click a different sentence to see a different winner.

Sentence	Focus word	Top-attended neighbor	Attention %	Interpretation

Observe. In every one of the five sentences, the ambiguous word's query lines up with exactly one neighbor that disambiguates it. "Bank" listens to "river" in one case, to "money" in another. "Apple" tunes in to "pie" versus "phone". That's disambiguation by attention—no hand-coded rules, just vector alignment.

Step 6

Three things attention is not

Myth

"Attention is just a weighted average, nothing new."
The weights themselves are computed from the data through $QK^\top$—they depend on every pair of tokens. No static weighted sum gives you that. It's a function whose coefficients are learned on the fly.

Myth

"Attention tells you what the model cares about."
Attention weights are one signal inside a deep model. Cutting a low-weight neighbor doesn't necessarily change the prediction, and the same logit can be reached via many weight patterns. Reading them as explanations is risky.

Myth

"Every word attends to every other word equally in cost."
Self-attention is $O(n^2)$ in sequence length: doubling the sentence quadruples the compute. That's why long-context models use tricks (sparse, linear, sliding-window) to cut this down.

Step 7

What we swept under the rug

Everything you built above is one head of one layer. Real Transformers stack a bunch of these:

Multi-head attention. Run the same Q/K/V game 8 or 16 times in parallel, each with different learned projections. Different heads end up specializing—some pick up syntax, some coreference, some long-range dependencies.
Many layers. The output of one attention block becomes the input to the next. After a few layers, "bank" doesn't just carry its own meaning plus its neighbors'—it carries meaning shaped by chains of inference across the whole sentence.
Positional encodings. Attention itself is permutation-invariant—it has no idea which word came first. So we add a position signal to the embeddings before the first attention block.

Final takeaway. Self-attention is a Transformer's universal way of asking "which neighbors should I listen to?" You just computed every step of that answer by hand. Stacking this operation many times, in parallel and in depth, is what builds GPT-level understanding from raw tokens.