Interactive Explainer
Text Diffusion, Tiny
Language models usually predict one token at a time, left to right. Diffusion models take a different path: start with noise, and iteratively refine the whole sequence at once.
This page trains a character-level discrete diffusion model on Indian names in your browser and lets you watch it denoise from pure masks to readable text.
I. Not one token at a time
Most language models you have seen are autoregressive. They predict the next token given everything before it. GPT, LLaMA, Claude — all left-to-right, one token at a time. This works brilliantly, but it is not the only way to generate sequences.
Autoregressive
Predict each token from its left context
Generation is sequential: sample position 1, then 2 given 1, then 3 given 1–2, and so on. The model never looks to the right. Each step waits for the previous one to finish.
Discrete diffusion
Corrupt the whole sequence, then learn to undo it
The model predicts all positions at once. At generation time it starts from pure noise (all masks) and iteratively fills in characters. Positions can be revealed in any order — not just left to right.
Here is what these two approaches look like in practice, step by step, for the name rahul.
Autoregressive: left to right
Five sequential steps. Each step must wait for the previous one. The model never sees characters to the right of the cursor.
Discrete diffusion: denoise in parallel
Three parallel steps. The model predicts all positions at once. Characters can appear in any order — the middle before the start.
II. How corruption dissolves a name
The forward corruption process is dead simple. Given a clean name like
priya, pad it to a fixed length and replace each character with a
mask token ? independently with probability t.
At t = 0 the name is untouched.
At t = 1 every character is masked and all information
is destroyed.
This is the discrete analogue of adding Gaussian noise in image diffusion. Instead
of blurring pixel values, we erase characters. The noise level t plays
the same role as the diffusion timestep: it controls how much information the model
gets to work with.
Training dataset — 40 Indian names
III. A model that guesses the original
The model takes the corrupted sequence xt plus the noise
level t, and outputs logits for the clean character at every position.
We use the simplest possible architecture: embed each token, flatten everything,
concatenate the time, pass through a hidden layer, and reshape into per-position
predictions.
_ and a mask
token ?.
The loss is cross-entropy averaged over all positions. The model must predict every
character of the original name — whether that position was masked or not.
Positions it can see give easy signal; masked positions force real learning. Time
conditioning via t tells the model how noisy the input is, so it can
calibrate its confidence accordingly.
During training, each step picks a random name, a random noise level
t ~ Uniform(0, 1), corrupts the name, and backpropagates the
prediction error. Over time the model learns the structure of Indian names —
common character patterns, typical lengths, where vowels and consonants tend to fall.
IV. Train in your browser
Below is a live neural network. Click Train to initialize random weights and run stochastic gradient descent. Each training step picks a random name, a random noise level, corrupts the name, and backpropagates. The loss should drop from ~3.1 (random guessing over 22 characters) to near zero.
After training completes, click +500 more to continue training for extra epochs. You can adjust the learning rate and see how it affects convergence. A high learning rate trains fast but can overshoot. A low one is stable but slow.
V. Denoise step by step
Once trained, generation starts from a fully masked sequence
? ? ? ? ? ? ? and runs the model repeatedly. At each step the model
predicts clean characters for every position. We reveal a fraction of the masked
positions and repeat, gradually lowering the noise level until the sequence is
fully determined.
The temperature controls randomness: lower values make the model more confident (more likely to pick training names exactly), higher values introduce variation (sometimes producing novel combinations). The number of steps controls how gradually positions are revealed.
Denoising timeline
VI. What we built
This toy model is BERT-style masked denoising + timestep conditioning + iterative sampling. It is the simplest practical diffusion-like model for character sequences, and it captures all the essential ingredients of discrete diffusion for text.
1. Corruption is the noise
Random masking plays the role of Gaussian noise in image diffusion.
Instead of adding continuous noise to pixel values, we replace discrete tokens with a mask symbol. The noise level t controls how much is hidden.
2. Whole-sequence prediction
The model predicts every position at once, not one at a time.
This is the key departure from autoregressive models. The model sees the full (corrupted) context and produces a complete clean prediction in parallel.
3. Time tells the model how noisy the input is
Conditioning on t lets a single model handle all noise levels.
At high t, the model must guess from almost nothing. At low t, it mostly confirms what it already sees. One set of weights serves the whole corruption spectrum.
4. Iterative refinement at generation
Sampling is a loop: predict, reveal some, repeat.
Starting from all masks, each step reveals a few more characters. This iterative structure — not the architecture — is what makes it "diffusion."
VII. From words to answers
The name generator creates sequences from nothing — unconditional generation. But diffusion can also do conditional generation: given a question, produce the answer. The only change is what gets masked.
We will train a second model on 30 English-to-Hindi word pairs. Each training
example looks like hello>namaste. The question and separator
> are never masked — only the answer gets corrupted.
The model learns to denoise the answer while reading the question as fixed context.
Training dataset — 30 English → Hindi pairs
Loss is computed only on answer positions. The model sees the question as context and learns what answer should follow. At inference, you type a question, mask the entire answer region, and iteratively denoise — exactly the same loop as before, but now the question steers the output.
Train the translator
Translate a word
Denoising timeline — question fixed, answer revealed
>) get denoised step by step. This is conditional diffusion —
the same iterative reveal, but steered by a fixed input.
Unconditional vs conditional diffusion
| Names (unconditional) | Q&A (conditional) | |
|---|---|---|
| Corrupt | All positions | Answer only |
| Loss | All positions | Answer only |
| Sampling | Start from all masks | Fix question, mask answer |
| Output | Random name | Answer to a given question |