Interactive Explainer

Numerical Tricks of the Trade

Floating point math has limits—$e^{1000}$ overflows to infinity, $\log(0)$ explodes to $-\infty$, and small differences vanish to zero. Modern deep learning would be impossible without a handful of tricks that keep the math honest. This page lets you watch the naive computation break, then watch the trick fix it.

~14 min Deep Learning · Numerical Stability · Floating Point

Prelude

The problem with $e^{1000}$

Float64 (the default in NumPy) can represent numbers up to about $10^{308}$. Float32 (the default in PyTorch) only up to about $10^{38}$. That sounds enormous—until you realize that $e^{100} \approx 10^{43}$ already overflows float32, and $e^{1000} \approx 10^{434}$ overflows everything.

This is not a hypothetical concern. Logits in a real network easily reach magnitudes of 50 or 100. Cross-entropy losses divide by tiny probabilities. Sigmoids saturate at 0 or 1. Every one of these creates a numerical landmine.

Float64 limits: max ≈ 1.8 × 10³⁰⁸, min positive normal ≈ 2.2 × 10^-308
Float32 limits: max ≈ 3.4 × 10³⁸, min positive normal ≈ 1.2 × 10^-38

The good news: every common pitfall has a well-known fix. By the end of this page you will have watched the four most important ones break and be rescued, on the same logits.

Trick 1

Log-sum-exp: the universal stabilizer

Many quantities in ML reduce to computing $\log\sum_i e^{x_i}$. Sounds harmless. Until any $x_i$ is large—then $e^{x_i}$ overflows, the sum is $\infty$, and $\log\infty = \infty$.

\mathrm{LSE}(\mathbf{x}) = \log \sum_{i=1}^n e^{x_i}

The trick is to factor out the maximum:

log Σ exp(xᵢ) = log [exp(m) · Σ exp(xᵢ - m)] where m = max(xᵢ) = m + log Σ exp(xᵢ - m)

Now the largest argument inside the exp is exactly $0$, so $e^0 = 1$ is always representable, and every other term is in $[0, 1]$. Mathematically identical to the naive form, but never overflows.

Logit magnitude $x_1$ = 5.0

Vector: $\mathbf{x} = [x_1,\ 1.5,\ 0.2,\ -2.0]$. Compute $\log\sum e^{x_i}$ both ways.

Naive — direct formula

Stable — log-sum-exp trick

Where you'll see this. Inside every softmax, log-softmax, and cross-entropy implementation in PyTorch, JAX, and TensorFlow. Inside the partition function of energy-based models. Inside beam search for sequence models. Inside the KL-divergence computation between two categorical distributions. If you remember one numerical trick, this is the one.

Trick 2

Log-softmax: not (log ∘ softmax)

Cross-entropy needs $\log p_k$. The naive composition is:

\log p_k = \log\left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)

Computed naively (compute softmax, then take log), this fails twice:

The exp-sum can overflow (Trick 1 again).
If any $p_k \approx 0$, then $\log p_k \to -\infty$ and gradients explode.

The clean form uses log-sum-exp directly:

log p_k = z_k - log Σⱼ exp(zⱼ) = z_k - LSE(z) = z_k - max(z) - log Σⱼ exp(zⱼ - max(z))

No exponential of a small probability is ever computed. PyTorch exposes this as F.log_softmax exactly because composing log(softmax(x)) is brittle.

Largest logit $z_{\max}$ = 5.0

Smallest logit $z_{\min}$ = -5.0

Vector: $\mathbf{z} = [z_{\max},\ 0,\ z_{\min}]$. Compute $\log p_{\min}$ — the log-probability of the smallest class — which is the dangerous one.

Naive — log(softmax(z))

Stable — log_softmax(z)

Try the danger zone. Push $z_{\max}$ to 200 — naive softmax overflows ($e^{200} = \infty$), then $\log(\infty) = \infty$ for the largest class but $\log(\text{NaN}/\infty)$ for the smallest. Push $z_{\min}$ to -200 — naive computes $e^{-200}$, which underflows to 0, then $\log 0 = -\infty$. The stable version handles both: it just returns $z_{\min} - z_{\max}$ minus a small correction.

Trick 3

BCE-with-logits: avoid the log-of-sigmoid trap

Binary cross-entropy is:

\mathcal{L} = -y\log\sigma(z) - (1-y)\log(1 - \sigma(z))

Both $\log\sigma(z)$ and $\log(1 - \sigma(z))$ are time bombs. If $z$ is strongly negative, $\sigma(z) \to 0$, and $\log 0 = -\infty$. If $z$ is strongly positive, $1 - \sigma(z) \to 0$, same problem.

The fix uses the identity $\log\sigma(z) = -\log(1 + e^{-z}) = -\mathrm{softplus}(-z)$, and $\log(1 - \sigma(z)) = -z - \log(1 + e^{-z}) = -z - \mathrm{softplus}(-z)$. Substituting:

L = max(z, 0) - z·y + log(1 + exp(-|z|))

This never overflows: the only exp is $e^{-|z|}$, which is in $(0, 1]$. PyTorch implements this as F.binary_cross_entropy_with_logits. Composing BCE(sigmoid(z)) by hand is the wrong way.

Logit $z$ = -30

Label $y$ = 1

Naive — log(sigmoid(z))

Stable — BCE-with-logits

The pattern. Whenever you see "apply $\sigma$ then take $\log$" in your code, pause. There is almost always a single fused operation in the framework that does it stably. Same lesson as log_softmax: never compose unstable building blocks when a stable atomic exists.

Trick 4

log(1 - exp(-x)) and the catastrophic cancellation cousin

The function $\log(1 - e^{-x})$ shows up in survival analysis, in Beta-distribution log-densities, and in some KL divergences. For small $x$ (say $x = 10^{-5}$), $e^{-x} \approx 1 - x$, so $1 - e^{-x} \approx x$, but the subtraction $1 - e^{-x}$ cancels almost all the significant bits of the floating-point representation. You lose precision.

\log(1 - e^{-x}) = \begin{cases} \log(-\mathrm{expm1}(-x)) & x \le \log 2 \\[2pt] \log\!1\!\mathrm{p}(-e^{-x}) & x > \log 2 \end{cases}

Two specialized library functions, expm1 ($e^x - 1$) and log1p ($\log(1 + x)$), preserve precision in exactly the regions where the naive formula loses it. They are in the C standard library, NumPy, and PyTorch for this single reason.

x = 1e-5 slider value is $\log_{10}(x)$, so the actual $x$ ranges from $10^{-12}$ to $10^3$

Naive — log(1 - exp(-x))

Stable — using expm1 / log1p

Drop $x$ to $10^{-7}$. The naive version returns either $-\infty$ (if $1 - e^{-x}$ rounds to exactly 0) or a value missing 6 digits of precision. The stable version is correct to machine epsilon.

Trick 5

The other usual suspects

A reference card of tricks you'll meet repeatedly:

Quantity	Naive	Stable	Why
softmax	$e^{z_k} / \sum e^{z_j}$	$e^{z_k - m} / \sum e^{z_j - m}$, $m = \max z$	Avoids $e^{\text{huge}}$
log_softmax	$\log(\text{softmax}(z))$	$z_k - \mathrm{LSE}(z)$	Skips $\log$ of underflowed probability
BCE	$-y\log\sigma(z) - (1-y)\log(1-\sigma(z))$	$\max(z,0) - zy + \log(1 + e^{-\|z\|})$	Single $\log$, single $\exp$, both bounded
$\log(1+x)$ for tiny $x$	$\log(1 + x)$	$\mathrm{log1p}(x)$	$1 + 10^{-10}$ rounds to $1$, losing all info
$e^x - 1$ for tiny $x$	$\exp(x) - 1$	$\mathrm{expm1}(x)$	Same cancellation: $1 - 1 = 0$ loses everything
std of $n$ values	$\sqrt{\frac{1}{n}\sum x_i^2 - \bar{x}^2}$	Welford's online algorithm	Two large numbers minus each other → cancellation
$\log\Gamma(x)$ for large $x$	$\log(\Gamma(x))$	$\mathrm{lgamma}(x)$ (Stirling-based)	$\Gamma(170) \approx 10^{306}$ is the last representable factorial in float64
variance from scaled values	direct sum of squares	BatchNorm uses $\bar{x}^2 + \epsilon$ for $\sigma^2$	Tiny variance in some channels would otherwise divide-by-zero
division by softmax weights	$x / p_k$	multiply by $\exp(-\log p_k)$ via log_softmax	Avoid dividing by underflowed $p$
L2 norm of huge vector	$\sqrt{\sum x_i^2}$	$\max\|x\| \cdot \sqrt{\sum (x_i / \max\|x\|)^2}$	$x_i^2$ overflows for $x_i > \sqrt{\text{max float}}$

Step 6

Three rules of thumb

Rule 1

Never compose log with something that can be tiny. log(softmax(x)), log(sigmoid(x)), and log(1 - sigmoid(x)) are all forbidden. Use the fused stable atomic instead.

Rule 2

Never compute exp of an unbounded quantity. Always shift first: subtract the max, normalize, or factor out the largest term. The mathematical answer is unchanged, the numerical answer is finite.

Rule 3

Trust the framework's *_with_logits functions. They exist because the naive composition is unstable. cross_entropy(logits, y), bce_with_logits(z, y), log_softmax(z), logsumexp(z) — use them.

Final takeaway. Numerical stability is invisible when it works and catastrophic when it doesn't. NaN losses, exploding gradients, mysterious training failures — a huge fraction of these trace back to a naive composition that should have been the fused stable form. The next time your loss goes nan, check your logs and your exps first. You'll usually find the culprit there.