Interactive Explainer
Numerical Tricks of the Trade
Floating point math has limits—$e^{1000}$ overflows to infinity, $\log(0)$ explodes to $-\infty$, and small differences vanish to zero. Modern deep learning would be impossible without a handful of tricks that keep the math honest. This page lets you watch the naive computation break, then watch the trick fix it.
The problem with $e^{1000}$
Float64 (the default in NumPy) can represent numbers up to about $10^{308}$. Float32 (the default in PyTorch) only up to about $10^{38}$. That sounds enormous—until you realize that $e^{100} \approx 10^{43}$ already overflows float32, and $e^{1000} \approx 10^{434}$ overflows everything.
This is not a hypothetical concern. Logits in a real network easily reach magnitudes of 50 or 100. Cross-entropy losses divide by tiny probabilities. Sigmoids saturate at 0 or 1. Every one of these creates a numerical landmine.
Float32 limits: max ≈ 3.4 × 1038, min positive normal ≈ 1.2 × 10-38
The good news: every common pitfall has a well-known fix. By the end of this page you will have watched the four most important ones break and be rescued, on the same logits.
Log-sum-exp: the universal stabilizer
Many quantities in ML reduce to computing $\log\sum_i e^{x_i}$. Sounds harmless. Until any $x_i$ is large—then $e^{x_i}$ overflows, the sum is $\infty$, and $\log\infty = \infty$.
The trick is to factor out the maximum:
Now the largest argument inside the exp is exactly $0$, so $e^0 = 1$ is always representable, and every other term is in $[0, 1]$. Mathematically identical to the naive form, but never overflows.
Vector: $\mathbf{x} = [x_1,\ 1.5,\ 0.2,\ -2.0]$. Compute $\log\sum e^{x_i}$ both ways.
Log-softmax: not (log ∘ softmax)
Cross-entropy needs $\log p_k$. The naive composition is:
Computed naively (compute softmax, then take log), this fails twice:
- The exp-sum can overflow (Trick 1 again).
- If any $p_k \approx 0$, then $\log p_k \to -\infty$ and gradients explode.
The clean form uses log-sum-exp directly:
No exponential of a small probability is ever computed. PyTorch exposes
this as F.log_softmax exactly because composing
log(softmax(x)) is brittle.
Vector: $\mathbf{z} = [z_{\max},\ 0,\ z_{\min}]$. Compute $\log p_{\min}$ — the log-probability of the smallest class — which is the dangerous one.
BCE-with-logits: avoid the log-of-sigmoid trap
Binary cross-entropy is:
Both $\log\sigma(z)$ and $\log(1 - \sigma(z))$ are time bombs. If $z$ is strongly negative, $\sigma(z) \to 0$, and $\log 0 = -\infty$. If $z$ is strongly positive, $1 - \sigma(z) \to 0$, same problem.
The fix uses the identity $\log\sigma(z) = -\log(1 + e^{-z}) = -\mathrm{softplus}(-z)$, and $\log(1 - \sigma(z)) = -z - \log(1 + e^{-z}) = -z - \mathrm{softplus}(-z)$. Substituting:
This never overflows: the only exp is $e^{-|z|}$, which is in $(0, 1]$.
PyTorch implements this as F.binary_cross_entropy_with_logits.
Composing BCE(sigmoid(z)) by hand is the wrong way.
log_softmax: never compose
unstable building blocks when a stable atomic exists.
log(1 - exp(-x)) and the catastrophic cancellation cousin
The function $\log(1 - e^{-x})$ shows up in survival analysis, in Beta-distribution log-densities, and in some KL divergences. For small $x$ (say $x = 10^{-5}$), $e^{-x} \approx 1 - x$, so $1 - e^{-x} \approx x$, but the subtraction $1 - e^{-x}$ cancels almost all the significant bits of the floating-point representation. You lose precision.
Two specialized library functions, expm1 ($e^x - 1$) and
log1p ($\log(1 + x)$), preserve precision in exactly the
regions where the naive formula loses it. They are in the C standard
library, NumPy, and PyTorch for this single reason.
Drop $x$ to $10^{-7}$. The naive version returns either $-\infty$ (if $1 - e^{-x}$ rounds to exactly 0) or a value missing 6 digits of precision. The stable version is correct to machine epsilon.
The other usual suspects
A reference card of tricks you'll meet repeatedly:
| Quantity | Naive | Stable | Why |
|---|---|---|---|
| softmax | $e^{z_k} / \sum e^{z_j}$ | $e^{z_k - m} / \sum e^{z_j - m}$, $m = \max z$ | Avoids $e^{\text{huge}}$ |
| log_softmax | $\log(\text{softmax}(z))$ | $z_k - \mathrm{LSE}(z)$ | Skips $\log$ of underflowed probability |
| BCE | $-y\log\sigma(z) - (1-y)\log(1-\sigma(z))$ | $\max(z,0) - zy + \log(1 + e^{-|z|})$ | Single $\log$, single $\exp$, both bounded |
| $\log(1+x)$ for tiny $x$ | $\log(1 + x)$ | $\mathrm{log1p}(x)$ | $1 + 10^{-10}$ rounds to $1$, losing all info |
| $e^x - 1$ for tiny $x$ | $\exp(x) - 1$ | $\mathrm{expm1}(x)$ | Same cancellation: $1 - 1 = 0$ loses everything |
| std of $n$ values | $\sqrt{\frac{1}{n}\sum x_i^2 - \bar{x}^2}$ | Welford's online algorithm | Two large numbers minus each other → cancellation |
| $\log\Gamma(x)$ for large $x$ | $\log(\Gamma(x))$ | $\mathrm{lgamma}(x)$ (Stirling-based) | $\Gamma(170) \approx 10^{306}$ is the last representable factorial in float64 |
| variance from scaled values | direct sum of squares | BatchNorm uses $\bar{x}^2 + \epsilon$ for $\sigma^2$ | Tiny variance in some channels would otherwise divide-by-zero |
| division by softmax weights | $x / p_k$ | multiply by $\exp(-\log p_k)$ via log_softmax | Avoid dividing by underflowed $p$ |
| L2 norm of huge vector | $\sqrt{\sum x_i^2}$ | $\max|x| \cdot \sqrt{\sum (x_i / \max|x|)^2}$ | $x_i^2$ overflows for $x_i > \sqrt{\text{max float}}$ |
Three rules of thumb
Never compose log with something that can be tiny.
log(softmax(x)), log(sigmoid(x)), and log(1 - sigmoid(x))
are all forbidden. Use the fused stable atomic instead.
Never compute exp of an unbounded quantity.
Always shift first: subtract the max, normalize, or factor out the largest term. The
mathematical answer is unchanged, the numerical answer is finite.
Trust the framework's *_with_logits functions.
They exist because the naive composition is unstable. cross_entropy(logits, y),
bce_with_logits(z, y), log_softmax(z), logsumexp(z) — use them.
nan, check your logs and your exps
first. You'll usually find the culprit there.