← ES 667 Interactives

Interactive Explainer

Numerical Tricks of the Trade

Floating point math has limits—$e^{1000}$ overflows to infinity, $\log(0)$ explodes to $-\infty$, and small differences vanish to zero. Modern deep learning would be impossible without a handful of tricks that keep the math honest. This page lets you watch the naive computation break, then watch the trick fix it.

Prelude

The problem with $e^{1000}$

Float64 (the default in NumPy) can represent numbers up to about $10^{308}$. Float32 (the default in PyTorch) only up to about $10^{38}$. That sounds enormous—until you realize that $e^{100} \approx 10^{43}$ already overflows float32, and $e^{1000} \approx 10^{434}$ overflows everything.

This is not a hypothetical concern. Logits in a real network easily reach magnitudes of 50 or 100. Cross-entropy losses divide by tiny probabilities. Sigmoids saturate at 0 or 1. Every one of these creates a numerical landmine.

Float64 limits: max ≈ 1.8 × 10308, min positive normal ≈ 2.2 × 10-308
Float32 limits: max ≈ 3.4 × 1038, min positive normal ≈ 1.2 × 10-38

The good news: every common pitfall has a well-known fix. By the end of this page you will have watched the four most important ones break and be rescued, on the same logits.

Trick 1

Log-sum-exp: the universal stabilizer

Many quantities in ML reduce to computing $\log\sum_i e^{x_i}$. Sounds harmless. Until any $x_i$ is large—then $e^{x_i}$ overflows, the sum is $\infty$, and $\log\infty = \infty$.

$$\mathrm{LSE}(\mathbf{x}) = \log \sum_{i=1}^n e^{x_i}$$

The trick is to factor out the maximum:

log Σ exp(xᵢ) = log [exp(m) · Σ exp(xᵢ - m)] where m = max(xᵢ) = m + log Σ exp(xᵢ - m)

Now the largest argument inside the exp is exactly $0$, so $e^0 = 1$ is always representable, and every other term is in $[0, 1]$. Mathematically identical to the naive form, but never overflows.

Vector: $\mathbf{x} = [x_1,\ 1.5,\ 0.2,\ -2.0]$. Compute $\log\sum e^{x_i}$ both ways.

Naive — direct formula
Stable — log-sum-exp trick
Where you'll see this. Inside every softmax, log-softmax, and cross-entropy implementation in PyTorch, JAX, and TensorFlow. Inside the partition function of energy-based models. Inside beam search for sequence models. Inside the KL-divergence computation between two categorical distributions. If you remember one numerical trick, this is the one.
Trick 2

Log-softmax: not (log ∘ softmax)

Cross-entropy needs $\log p_k$. The naive composition is:

$$\log p_k = \log\left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)$$

Computed naively (compute softmax, then take log), this fails twice:

The clean form uses log-sum-exp directly:

log p_k = z_k - log Σⱼ exp(zⱼ) = z_k - LSE(z) = z_k - max(z) - log Σⱼ exp(zⱼ - max(z))

No exponential of a small probability is ever computed. PyTorch exposes this as F.log_softmax exactly because composing log(softmax(x)) is brittle.

Vector: $\mathbf{z} = [z_{\max},\ 0,\ z_{\min}]$. Compute $\log p_{\min}$ — the log-probability of the smallest class — which is the dangerous one.

Naive — log(softmax(z))
Stable — log_softmax(z)
Try the danger zone. Push $z_{\max}$ to 200 — naive softmax overflows ($e^{200} = \infty$), then $\log(\infty) = \infty$ for the largest class but $\log(\text{NaN}/\infty)$ for the smallest. Push $z_{\min}$ to -200 — naive computes $e^{-200}$, which underflows to 0, then $\log 0 = -\infty$. The stable version handles both: it just returns $z_{\min} - z_{\max}$ minus a small correction.
Trick 3

BCE-with-logits: avoid the log-of-sigmoid trap

Binary cross-entropy is:

$$\mathcal{L} = -y\log\sigma(z) - (1-y)\log(1 - \sigma(z))$$

Both $\log\sigma(z)$ and $\log(1 - \sigma(z))$ are time bombs. If $z$ is strongly negative, $\sigma(z) \to 0$, and $\log 0 = -\infty$. If $z$ is strongly positive, $1 - \sigma(z) \to 0$, same problem.

The fix uses the identity $\log\sigma(z) = -\log(1 + e^{-z}) = -\mathrm{softplus}(-z)$, and $\log(1 - \sigma(z)) = -z - \log(1 + e^{-z}) = -z - \mathrm{softplus}(-z)$. Substituting:

L = max(z, 0) - z·y + log(1 + exp(-|z|))

This never overflows: the only exp is $e^{-|z|}$, which is in $(0, 1]$. PyTorch implements this as F.binary_cross_entropy_with_logits. Composing BCE(sigmoid(z)) by hand is the wrong way.

Naive — log(sigmoid(z))
Stable — BCE-with-logits
The pattern. Whenever you see "apply $\sigma$ then take $\log$" in your code, pause. There is almost always a single fused operation in the framework that does it stably. Same lesson as log_softmax: never compose unstable building blocks when a stable atomic exists.
Trick 4

log(1 - exp(-x)) and the catastrophic cancellation cousin

The function $\log(1 - e^{-x})$ shows up in survival analysis, in Beta-distribution log-densities, and in some KL divergences. For small $x$ (say $x = 10^{-5}$), $e^{-x} \approx 1 - x$, so $1 - e^{-x} \approx x$, but the subtraction $1 - e^{-x}$ cancels almost all the significant bits of the floating-point representation. You lose precision.

$$\log(1 - e^{-x}) = \begin{cases} \log(-\mathrm{expm1}(-x)) & x \le \log 2 \\[2pt] \log\!1\!\mathrm{p}(-e^{-x}) & x > \log 2 \end{cases}$$

Two specialized library functions, expm1 ($e^x - 1$) and log1p ($\log(1 + x)$), preserve precision in exactly the regions where the naive formula loses it. They are in the C standard library, NumPy, and PyTorch for this single reason.

slider value is $\log_{10}(x)$, so the actual $x$ ranges from $10^{-12}$ to $10^3$
Naive — log(1 - exp(-x))
Stable — using expm1 / log1p

Drop $x$ to $10^{-7}$. The naive version returns either $-\infty$ (if $1 - e^{-x}$ rounds to exactly 0) or a value missing 6 digits of precision. The stable version is correct to machine epsilon.

Trick 5

The other usual suspects

A reference card of tricks you'll meet repeatedly:

Quantity Naive Stable Why
softmax $e^{z_k} / \sum e^{z_j}$ $e^{z_k - m} / \sum e^{z_j - m}$, $m = \max z$ Avoids $e^{\text{huge}}$
log_softmax $\log(\text{softmax}(z))$ $z_k - \mathrm{LSE}(z)$ Skips $\log$ of underflowed probability
BCE $-y\log\sigma(z) - (1-y)\log(1-\sigma(z))$ $\max(z,0) - zy + \log(1 + e^{-|z|})$ Single $\log$, single $\exp$, both bounded
$\log(1+x)$ for tiny $x$ $\log(1 + x)$ $\mathrm{log1p}(x)$ $1 + 10^{-10}$ rounds to $1$, losing all info
$e^x - 1$ for tiny $x$ $\exp(x) - 1$ $\mathrm{expm1}(x)$ Same cancellation: $1 - 1 = 0$ loses everything
std of $n$ values $\sqrt{\frac{1}{n}\sum x_i^2 - \bar{x}^2}$ Welford's online algorithm Two large numbers minus each other → cancellation
$\log\Gamma(x)$ for large $x$ $\log(\Gamma(x))$ $\mathrm{lgamma}(x)$ (Stirling-based) $\Gamma(170) \approx 10^{306}$ is the last representable factorial in float64
variance from scaled values direct sum of squares BatchNorm uses $\bar{x}^2 + \epsilon$ for $\sigma^2$ Tiny variance in some channels would otherwise divide-by-zero
division by softmax weights $x / p_k$ multiply by $\exp(-\log p_k)$ via log_softmax Avoid dividing by underflowed $p$
L2 norm of huge vector $\sqrt{\sum x_i^2}$ $\max|x| \cdot \sqrt{\sum (x_i / \max|x|)^2}$ $x_i^2$ overflows for $x_i > \sqrt{\text{max float}}$
Step 6

Three rules of thumb

Rule 1

Never compose log with something that can be tiny. log(softmax(x)), log(sigmoid(x)), and log(1 - sigmoid(x)) are all forbidden. Use the fused stable atomic instead.

Rule 2

Never compute exp of an unbounded quantity. Always shift first: subtract the max, normalize, or factor out the largest term. The mathematical answer is unchanged, the numerical answer is finite.

Rule 3

Trust the framework's *_with_logits functions. They exist because the naive composition is unstable. cross_entropy(logits, y), bce_with_logits(z, y), log_softmax(z), logsumexp(z) — use them.

Final takeaway. Numerical stability is invisible when it works and catastrophic when it doesn't. NaN losses, exploding gradients, mysterious training failures — a huge fraction of these trace back to a naive composition that should have been the fused stable form. The next time your loss goes nan, check your logs and your exps first. You'll usually find the culprit there.