Interactive Explainer
Attention, Calculated
Pick an ambiguous word, drag its Query vector, and watch a Transformer figure out which neighbors to listen to—with every dot product, softmax weight, and value blend computed live as you move the arrows.
The ambiguity problem
A word in isolation is almost meaningless. Consider the word bank. Is it a river bank or a money bank? The word alone can't tell you—you have to look at its neighbors.
Self-attention is how Transformers let each word ask around. Every word in a sentence says to every other word: "Hey, how much should I care about you?" The answer comes back as a single number between 0 and 1. Then each word rebuilds itself as a weighted mix of its neighbors' meanings.
This page makes all of that explicit. By the end you'll have pushed raw scores through a softmax, watched a blended value vector form, and built the intuition for why Transformers beat every prior architecture at disambiguation.
Pick your sentence
We'll work through a handful of classic ambiguous words. Pick one and the whole page follows it:
The focus word bank needs to figure out whether it is water-related or money-related by listening to its neighbors.
Three roles for every word
In a Transformer, every word is associated with three separate vectors, each playing a different role:
- Query ($Q$): "What am I looking for?" This is the perspective of the word that wants to update itself.
- Key ($K$): "What do I offer?" Every other word advertises its flavor here.
- Value ($V$): "If you decide to attend to me, here is the actual meaning I contribute."
Queries and keys determine how much to attend. Values determine what gets mixed in. Keeping those two roles apart is what makes attention expressive.
Dot product = compatibility score
How do we measure how well a Query matches a Key? We use the dot product. If the two vectors point the same way, the score is large and positive. If they're perpendicular, zero. If they point opposite ways, negative.
Drag the orange Query arrow for the focus word, and the blue Key arrows for every other word. Watch the raw scores update live below. Drag the query so it aligns with one specific key and you'll see that word win.
| Word | Query vector | Key vector | Dot product |
|---|
Softmax turns scores into percentages
Raw dot products aren't weights—they can be negative, or gigantic. We need them to behave like probabilities that sum to one. Softmax does exactly that:
Exponentiating amplifies differences: a score that's just a bit higher than the others rockets upward, dominating the weights. This is why softmax is called "soft" argmax—it picks a winner, but still passes non-zero signal to everyone else.
Step-by-step calculation (live)
Here's the full pipeline applied to your current dragged arrangement. Every number comes from the canvas above:
| Word | Raw score | exp(score) | Weight (softmax) |
|---|
Blend the values into a new meaning
Now the payoff. The focus word rebuilds its meaning as a weighted sum of the Value vectors, with the softmax weights we just computed:
Below, the teal arrows are the value vectors for every word (drag them to simulate different "meanings"). The dashed orange arrow is $V_{\text{final}}$—the new contextual meaning for the focus word. Watch it slide toward whichever value has the highest attention weight.
Run it on all five sentences
Here's one more look at the pipeline, this time fully automated. The table below evaluates the same Q, K, V vectors (the pre-set "sensible" ones for each ambiguous word) through the entire attention formula and shows which neighbor dominates. Click a different sentence to see a different winner.
| Sentence | Focus word | Top-attended neighbor | Attention % | Interpretation |
|---|
Three things attention is not
"Attention is just a weighted average, nothing new."
The weights themselves are computed from the data
through $QK^\top$—they depend on every pair of tokens. No
static weighted sum gives you that. It's a function whose
coefficients are learned on the fly.
"Attention tells you what the model cares about."
Attention weights are one signal inside a deep model.
Cutting a low-weight neighbor doesn't necessarily change the
prediction, and the same logit can be reached via many weight
patterns. Reading them as explanations is risky.
"Every word attends to every other word equally in cost."
Self-attention is $O(n^2)$ in sequence length: doubling the
sentence quadruples the compute. That's why long-context models
use tricks (sparse, linear, sliding-window) to cut this down.
What we swept under the rug
Everything you built above is one head of one layer. Real Transformers stack a bunch of these:
- Multi-head attention. Run the same Q/K/V game 8 or 16 times in parallel, each with different learned projections. Different heads end up specializing—some pick up syntax, some coreference, some long-range dependencies.
- Many layers. The output of one attention block becomes the input to the next. After a few layers, "bank" doesn't just carry its own meaning plus its neighbors'—it carries meaning shaped by chains of inference across the whole sentence.
- Positional encodings. Attention itself is permutation-invariant—it has no idea which word came first. So we add a position signal to the embeddings before the first attention block.