Interactive Explainer

Autograd, Seen

Automatic differentiation is often introduced as if it were just backprop magic. It is simpler than that. It is a way to compute exact derivatives of the program you actually ran by storing intermediate values and applying the chain rule locally.

Before we touch the interactive graph, we will separate autodiff from finite differences and symbolic algebra, then sketch the graph mechanics in plain view.

~20 min Deep Learning · Optimization · Chain Rule

Start with the definition Jump to the interactive graph

The one-sentence version

\[ \frac{\partial L}{\partial a} = \frac{\partial L}{\partial b} \cdot \frac{\partial b}{\partial a} \]

The backward pass carries an upstream gradient from right to left. Each node contributes one local derivative. Multiply the two, send the result to the parents, and accumulate whenever paths merge.

Not this Repeatedly nudging inputs with a tiny h

This Exact chain-rule bookkeeping on executed intermediate values

I. What autodiff is, and what it is not

There are three different ideas that often get blurred together. They all answer the question “how fast does the output change when the input changes?”, but they do it in very different ways.

Finite differences

Perturb the input and approximate the slope

f'(x) \approx \frac{f(x+h)-f(x)}{h}

This is numerical estimation. You rerun the function with a tiny change in the input. It is useful for gradient checking, but it is slow, approximate, and sensitive to the choice of h.

Symbolic differentiation

Rewrite the expression algebraically

\frac{d}{dx}\bigl((x^2+1)^5\bigr) = 10x(x^2+1)^4

This manipulates formulas directly. It can be exact, but large real programs are not clean symbolic expressions. They include branching, reused values, and big composite operations.

Automatic differentiation

Differentiate the executed program by local rules

\text{upstream gradient} \times \text{local derivative}

This is what deep-learning frameworks use. It follows the actual computation graph, stores the intermediate values from the forward pass, and produces exact derivatives up to normal floating-point arithmetic.

Practical takeaway. If you are training a model, autodiff is the workhorse. Finite differences are mainly a debugging tool. Symbolic differentiation is a different style of system.

II. A graph sketch before the interactive view

Here is the whole idea in still form. The forward pass creates intermediate values from left to right. The backward pass sends gradients back from the loss to the leaves. The graph matters because it tells us exactly who depends on whom.

Reverse-mode autodiff is especially useful in learning because many parameters feed one scalar loss. One backward sweep gives all parameter gradients together.

A node only needs a small amount of memory

node = {
  value,
  parents,
  backward(upstream) {
    return parents.map((parent) => upstream * localDerivative(parent));
  }
}

III. The tiny rulebook every engine needs

The scalar version of autograd can go a long way with a very small lookup table. The exact list below matches the pedagogical core from the animation notes: multiplication, addition, exponentiation, reciprocal, and logarithm. Everything more elaborate is built from these or wrapped into a bigger module that knows how to backprop internally.

Selected local rule

Multiply

Good engines stay humble here: they store the output value, remember the parents, and attach a backward rule that converts an upstream gradient into contributions for each parent.

IV. One example, traced forward and backward

Now we can animate a concrete scalar graph. Start with a single positive example. We will compute a log-loss from a scalar logit z = wx + b. The expression is deliberately expanded into tiny operations so you can see what the engine is really doing under the hood.

L = -\log\!\bigl(\sigma(wx + b)\bigr) = -\log\!\left(\frac{1}{1 + e^{-(wx+b)}}\right)

Stage0 / 17

DirectionSetup

Active focusInputs

Scrub the graph

The graph stays visually quiet. Exact derivatives and local rules live in the ledger, not as long labels colliding inside the diagram.

V. Why frameworks collapse graphs into modules

A teaching graph should show everything. A production graph should show just enough. That is why frameworks package chains of small operations into larger blocks. A linear layer hides the multiply and add. A sigmoid hides negation, exponentiation, addition, and reciprocal. A loss module can hide still more.

What changed?

Atomic graph: every local derivative is explicit.

This is ideal when you are learning or debugging the engine itself. Every node has a tiny backward rule and every edge is legible.

VI. Three examples, shared parameters, one update

The point of autograd is not a single example. It is repeated reuse. The same parameters show up in every example, so the backward pass must accumulate gradient contributions before the optimizer updates them. With only three points, you can already feel that batching logic instead of treating it as abstraction.

J(w,b) = \frac{1}{N} \sum_{i=1}^{N} \left[-y_i\log p_i - (1-y_i)\log(1-p_i)\right], \qquad p_i = \sigma(wx_i + b)

Mean loss0.000

dJ / dw0.000

dJ / db0.000

Learning rate0.20

Shared parameters

Weight w 1.20 Bias b -0.10 Learning rate 0.20

Each row produces its own probability, loss, and gradient contribution. The parameter update uses the mean gradient over all rows.

Example	x	y	p	loss	dL/dw

VII. What an autograd engine really stores

The finished story is smaller than the notation makes it seem. A node stores a value, a set of parents, and a local backward rule. The forward pass builds the graph and caches whatever the backward rule will need later. The backward pass starts from the scalar loss, walks the graph in reverse topological order, and accumulates contributions whenever multiple paths meet.

1. Local beats global

Every node only needs its own tiny derivative law.

The engine never differentiates the whole model symbolically. It composes many local facts.

2. Caching matters

Forward values are saved because backward rules need them.

For \log(a) you need a; for 1/a you need a; for sigmoid you usually need the output.

3. Modules are not cheating

They are just larger nodes with richer internal backward rules.

This is how nn.Linear, sigmoid, and cross-entropy stay ergonomic while remaining differentiable.

4. Batches only add accumulation

The same parameters receive many gradient messages before the update.

The optimizer step is simple only because autograd already summed everything that depended on those parameters.