Interactive Explainer
Softmax & Temperature
The softmax function turns raw scores into probabilities. The temperature knob $T$ controls how peaky or flat that distribution is—and is the same dial that powers creativity in LLMs and knowledge in distillation. This page lets you build the intuition with live formulas, live samples, and live entropy.
Why softmax exists
Your network's last layer outputs logits: raw real-valued scores $z_1, z_2, \ldots, z_K$ for $K$ classes. They can be negative, larger than one, anything. But to do classification or sample from the model, you need a probability distribution: numbers in $[0,1]$ that sum to one.
You could just normalize: $p_k = z_k / \sum_j z_j$. But that fails if any $z_k$ is negative. You could shift then normalize. But why pick that particular shift? Softmax is the principled answer: it arises naturally from the maximum likelihood framework, and it has a beautiful gradient.
Exponentiating guarantees positivity. Dividing by the sum guarantees they add to one. The "temperature" version, which we'll meet in Step 2, scales the logits before exponentiating—and with it comes one of the most useful knobs in modern deep learning.
From logits to probabilities
Edit the five logit values below. The bar chart on the right shows the resulting softmax probabilities at temperature $T = 1$. Notice how the relative order is preserved: the largest logit becomes the largest probability. Notice also how exponentiation amplifies differences—a logit gap of 2 becomes a probability ratio of $e^2 \approx 7.4$.
The temperature dial
Now we add a knob. Divide every logit by a positive number $T$ before taking softmax:
- $T \to 0$: the largest logit dominates. Distribution becomes one-hot (argmax).
- $T = 1$: standard softmax.
- $T \to \infty$: all logits become $\approx 0$. Distribution becomes uniform.
So $T$ smoothly interpolates between “always pick the best” and “pick uniformly at random”. Slide the temperature below and watch the distribution shift in real time.
Sampling: turning probabilities into picks
A probability distribution is just a recipe for randomness. To sample from softmax, you draw a uniform random number $u \in [0,1]$ and walk along the cumulative probabilities until you exceed $u$.
Click Draw one sample below. A vertical orange line drops onto the cumulative bar at position $u$, and the class it lands in is the pick. Try it a few times. Now slide the temperature high—you'll see every class sometimes get picked. Slide it low—always the same class.
Sampling at scale: the law of large numbers
One sample is random. A thousand samples should look like the underlying distribution. Click Draw 1000 and watch the empirical histogram (filled) converge onto the true softmax probabilities (outline).
This is the cleanest sanity check that softmax actually computes a valid probability distribution—the long-run frequencies match.
Where temperature shows up in practice
| Setting | What $T$ does | Typical value |
|---|---|---|
| LLM text generation | Controls creativity vs. coherence of generated tokens | 0.7 – 1.0 (creative), 0.0 – 0.3 (deterministic) |
| Knowledge distillation | Higher $T$ exposes the “dark knowledge” in the teacher's soft logits | $T \approx 4$ (Hinton et al. 2015) |
| Diffusion classifier-free guidance | Plays a similar role: trades fidelity for diversity | guidance scale $w \approx 3-7$ |
| Reinforcement learning (Boltzmann exploration) | Selects actions: high $T$ explores, low $T$ exploits | annealed from high to low |
| Calibration | A single learned $T$ rescales overconfident model outputs | $T > 1$ usually (temperature scaling) |
The same one-line formula—divide logits by $T$ before softmax—runs through all of these. Once you understand the bar chart above, you understand all of them.
Three things people get wrong about softmax
“Softmax outputs are calibrated probabilities.”
They sum to one, so they look like probabilities. But modern deep
networks are typically overconfident—the largest softmax
value is much higher than the model's true accuracy on those examples.
Calibration (like learned temperature scaling) fixes this.
“You should apply softmax then cross-entropy loss.”
You should not. PyTorch's CrossEntropyLoss applies softmax
internally, and computing it twice gives wrong gradients. Pass raw
logits.
“Lower temperature is always more accurate.”
For greedy decoding ($T=0$), yes—but for evaluation across
a distribution, calibrated probabilities ($T \approx 1$) give better
uncertainty estimates and better downstream pipeline behavior.