← ES 667 Interactives

Interactive Explainer

Softmax & Temperature

The softmax function turns raw scores into probabilities. The temperature knob $T$ controls how peaky or flat that distribution is—and is the same dial that powers creativity in LLMs and knowledge in distillation. This page lets you build the intuition with live formulas, live samples, and live entropy.

Prelude

Why softmax exists

Your network's last layer outputs logits: raw real-valued scores $z_1, z_2, \ldots, z_K$ for $K$ classes. They can be negative, larger than one, anything. But to do classification or sample from the model, you need a probability distribution: numbers in $[0,1]$ that sum to one.

You could just normalize: $p_k = z_k / \sum_j z_j$. But that fails if any $z_k$ is negative. You could shift then normalize. But why pick that particular shift? Softmax is the principled answer: it arises naturally from the maximum likelihood framework, and it has a beautiful gradient.

$$p_k = \mathrm{softmax}(\mathbf{z})_k = \dfrac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

Exponentiating guarantees positivity. Dividing by the sum guarantees they add to one. The "temperature" version, which we'll meet in Step 2, scales the logits before exponentiating—and with it comes one of the most useful knobs in modern deep learning.

Step 1

From logits to probabilities

Edit the five logit values below. The bar chart on the right shows the resulting softmax probabilities at temperature $T = 1$. Notice how the relative order is preserved: the largest logit becomes the largest probability. Notice also how exponentiation amplifies differences—a logit gap of 2 becomes a probability ratio of $e^2 \approx 7.4$.

$z_1$
$z_2$
$z_3$
$z_4$
$z_5$
Bars show softmax probabilities $p_k$ at $T = 1$. Numbers above each bar are the probability values.
Try this. Set all five logits to the same value (e.g. $z_k = 0$ for all $k$). What happens? You get a uniform distribution—softmax has no preference when scores are tied. Now make $z_1$ huge (say $10$). The probability mass collapses onto class 1. That's the "soft argmax" behavior the function is named for.
Step 2

The temperature dial

Now we add a knob. Divide every logit by a positive number $T$ before taking softmax:

$$p_k = \dfrac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}$$

So $T$ smoothly interpolates between “always pick the best” and “pick uniformly at random”. Slide the temperature below and watch the distribution shift in real time.

Same logits as Step 1, now scaled by $1/T$. Drag the slider to see the distribution morph from peaky to uniform.
Entropy $H = -\sum p \log p$
Max possible (uniform)
Confidence (top prob)
Connection to entropy. As $T$ grows, the distribution flattens and its entropy $H$ rises toward $\log K$ (the entropy of uniform). As $T \to 0$, $H \to 0$. Temperature is literally a knob on the entropy of the output distribution.
Step 3

Sampling: turning probabilities into picks

A probability distribution is just a recipe for randomness. To sample from softmax, you draw a uniform random number $u \in [0,1]$ and walk along the cumulative probabilities until you exceed $u$.

$$\text{Pick class } k \text{ if } \sum_{j < k} p_j \le u < \sum_{j \le k} p_j$$

Click Draw one sample below. A vertical orange line drops onto the cumulative bar at position $u$, and the class it lands in is the pick. Try it a few times. Now slide the temperature high—you'll see every class sometimes get picked. Slide it low—always the same class.

Top bar: cumulative probabilities by class (each colored segment has width $p_k$). Orange marker: the random $u$ value. Beneath: a tally of recent draws.
Why this matters. When an LLM “writes the next word”, this is exactly what's happening: the model outputs logits over the vocabulary, softmax converts them to probabilities, and the next token is sampled. Low $T$ → safe, repetitive text. High $T$ → creative but sometimes incoherent. You have just controlled the dial that ChatGPT exposes as “temperature”.
Step 4

Sampling at scale: the law of large numbers

One sample is random. A thousand samples should look like the underlying distribution. Click Draw 1000 and watch the empirical histogram (filled) converge onto the true softmax probabilities (outline).

This is the cleanest sanity check that softmax actually computes a valid probability distribution—the long-run frequencies match.

Filled bars: empirical frequencies from your samples. Outlined bars: the exact softmax probabilities. As $N$ grows, they converge.
Total samples drawn 0
Max abs error
Try this. Draw 100 samples, then 1,000, then 10,000 (use the slider). The filled bars should snap closer to the outlines each time. The error shrinks like $1/\sqrt{N}$—classic Monte Carlo convergence.
Step 5

Where temperature shows up in practice

Setting What $T$ does Typical value
LLM text generation Controls creativity vs. coherence of generated tokens 0.7 – 1.0 (creative), 0.0 – 0.3 (deterministic)
Knowledge distillation Higher $T$ exposes the “dark knowledge” in the teacher's soft logits $T \approx 4$ (Hinton et al. 2015)
Diffusion classifier-free guidance Plays a similar role: trades fidelity for diversity guidance scale $w \approx 3-7$
Reinforcement learning (Boltzmann exploration) Selects actions: high $T$ explores, low $T$ exploits annealed from high to low
Calibration A single learned $T$ rescales overconfident model outputs $T > 1$ usually (temperature scaling)

The same one-line formula—divide logits by $T$ before softmax—runs through all of these. Once you understand the bar chart above, you understand all of them.

Step 6

Three things people get wrong about softmax

Myth

“Softmax outputs are calibrated probabilities.”
They sum to one, so they look like probabilities. But modern deep networks are typically overconfident—the largest softmax value is much higher than the model's true accuracy on those examples. Calibration (like learned temperature scaling) fixes this.

Myth

“You should apply softmax then cross-entropy loss.”
You should not. PyTorch's CrossEntropyLoss applies softmax internally, and computing it twice gives wrong gradients. Pass raw logits.

Myth

“Lower temperature is always more accurate.”
For greedy decoding ($T=0$), yes—but for evaluation across a distribution, calibrated probabilities ($T \approx 1$) give better uncertainty estimates and better downstream pipeline behavior.

Final takeaway. Softmax is the canonical way to turn unbounded scores into a probability distribution. The temperature parameter $T$ smoothly trades determinism for diversity by reshaping the entropy of that distribution. You now know what the slider in every LLM API is actually doing.