Interactive Explainer
Autograd, Seen
Automatic differentiation is often introduced as if it were just backprop magic. It is simpler than that. It is a way to compute exact derivatives of the program you actually ran by storing intermediate values and applying the chain rule locally.
Before we touch the interactive graph, we will separate autodiff from finite differences and symbolic algebra, then sketch the graph mechanics in plain view.
I. What autodiff is, and what it is not
There are three different ideas that often get blurred together. They all answer the question “how fast does the output change when the input changes?”, but they do it in very different ways.
Finite differences
Perturb the input and approximate the slope
This is numerical estimation. You rerun the function with a tiny change in the input.
It is useful for gradient checking, but it is slow, approximate, and sensitive to the
choice of h.
Symbolic differentiation
Rewrite the expression algebraically
This manipulates formulas directly. It can be exact, but large real programs are not clean symbolic expressions. They include branching, reused values, and big composite operations.
Automatic differentiation
Differentiate the executed program by local rules
This is what deep-learning frameworks use. It follows the actual computation graph, stores the intermediate values from the forward pass, and produces exact derivatives up to normal floating-point arithmetic.
II. A graph sketch before the interactive view
Here is the whole idea in still form. The forward pass creates intermediate values from left to right. The backward pass sends gradients back from the loss to the leaves. The graph matters because it tells us exactly who depends on whom.
A node only needs a small amount of memory
node = {
value,
parents,
backward(upstream) {
return parents.map((parent) => upstream * localDerivative(parent));
}
}
III. The tiny rulebook every engine needs
The scalar version of autograd can go a long way with a very small lookup table. The exact list below matches the pedagogical core from the animation notes: multiplication, addition, exponentiation, reciprocal, and logarithm. Everything more elaborate is built from these or wrapped into a bigger module that knows how to backprop internally.
Selected local rule
Multiply
IV. One example, traced forward and backward
Now we can animate a concrete scalar graph. Start with a single positive example. We will
compute a log-loss from a scalar logit z = wx + b. The expression is
deliberately expanded into tiny operations so you can see what the engine is really doing
under the hood.
V. Why frameworks collapse graphs into modules
A teaching graph should show everything. A production graph should show just enough. That is why frameworks package chains of small operations into larger blocks. A linear layer hides the multiply and add. A sigmoid hides negation, exponentiation, addition, and reciprocal. A loss module can hide still more.
What changed?
Atomic graph: every local derivative is explicit.
This is ideal when you are learning or debugging the engine itself. Every node has a tiny backward rule and every edge is legible.
VI. Three examples, shared parameters, one update
The point of autograd is not a single example. It is repeated reuse. The same parameters show up in every example, so the backward pass must accumulate gradient contributions before the optimizer updates them. With only three points, you can already feel that batching logic instead of treating it as abstraction.
VII. What an autograd engine really stores
The finished story is smaller than the notation makes it seem. A node stores a value, a set of parents, and a local backward rule. The forward pass builds the graph and caches whatever the backward rule will need later. The backward pass starts from the scalar loss, walks the graph in reverse topological order, and accumulates contributions whenever multiple paths meet.
1. Local beats global
Every node only needs its own tiny derivative law.
The engine never differentiates the whole model symbolically. It composes many local facts.
2. Caching matters
Forward values are saved because backward rules need them.
For \log(a) you need a; for 1/a you need a; for sigmoid you usually need the output.
3. Modules are not cheating
They are just larger nodes with richer internal backward rules.
This is how nn.Linear, sigmoid, and cross-entropy stay ergonomic while remaining differentiable.
4. Batches only add accumulation
The same parameters receive many gradient messages before the update.
The optimizer step is simple only because autograd already summed everything that depended on those parameters.