Interactive Explainer

Text Diffusion, Tiny

Language models usually predict one token at a time, left to right. Diffusion models take a different path: start with noise, and iteratively refine the whole sequence at once.

This page trains a character-level discrete diffusion model on Indian names in your browser and lets you watch it denoise from pure masks to readable text.

~15 min Deep Learning · Diffusion · Generative Models

Start with the idea Jump to training

I. Not one token at a time

Most language models you have seen are autoregressive. They predict the next token given everything before it. GPT, LLaMA, Claude — all left-to-right, one token at a time. This works brilliantly, but it is not the only way to generate sequences.

Autoregressive

Predict each token from its left context

p(x_1, \dots, x_L) = \prod_{i=1}^{L} p(x_i \mid x_1, \dots, x_{i-1})

Generation is sequential: sample position 1, then 2 given 1, then 3 given 1–2, and so on. The model never looks to the right. Each step waits for the previous one to finish.

Discrete diffusion

Corrupt the whole sequence, then learn to undo it

\hat{x}_0 = f_\theta\!\bigl(\text{mask}(x_0,\; t),\; t\bigr)

The model predicts all positions at once. At generation time it starts from pure noise (all masks) and iteratively fills in characters. Positions can be revealed in any order — not just left to right.

Here is what these two approaches look like in practice, step by step, for the name rahul.

Autoregressive: left to right

sample p(x₁)

sample p(x₂ | r)

sample p(x₃ | r, a)

sample p(x₄ | r, a, h)

sample p(x₅ | r, a, h, u)

Five sequential steps. Each step must wait for the previous one. The model never sees characters to the right of the cursor.

Discrete diffusion: denoise in parallel

start: all masked (t = 1)

reveal one (t = 0.7)

reveal two more (t = 0.4)

final clean pass (t = 0)

Three parallel steps. The model predicts all positions at once. Characters can appear in any order — the middle before the start.

Why this matters. Discrete diffusion for text is an active research area (MDLM, SEDD, UDLM). The core idea is simple enough to fit in a browser demo. This page builds the smallest useful version: a mask-diffusion model on fixed-length character sequences. We will train it on 40 Indian names and watch it learn to generate new ones.

II. How corruption dissolves a name

The forward corruption process is dead simple. Given a clean name like priya, pad it to a fixed length and replace each character with a mask token ? independently with probability t. At t = 0 the name is untouched. At t = 1 every character is masked and all information is destroyed.

x_t[i] = \begin{cases} \texttt{?} & \text{with probability } t \\ x_0[i] & \text{otherwise} \end{cases}

This is the discrete analogue of adding Gaussian noise in image diffusion. Instead of blurring pixel values, we erase characters. The noise level t plays the same role as the diffusion timestep: it controls how much information the model gets to work with.

Noise level t 0.00

Slide the noise level to watch characters dissolve into masks. Each name uses a fixed random order so the reveal is smooth as you drag. Notice how short names like dev lose all information faster than long ones like reyansh.

Training dataset — 40 Indian names

III. A model that guesses the original

The model takes the corrupted sequence x_t plus the noise level t, and outputs logits for the clean character at every position. We use the simplest possible architecture: embed each token, flatten everything, concatenate the time, pass through a hidden layer, and reshape into per-position predictions.

Input (corrupted name, padded to length 7)

↓ embed each token (8 dims)

Flatten 7 × 8 = 56 values, append time t → 57-dim vector

↓

Linear (57 → 128) + ReLU ~28k learnable parameters total

↓

Linear (128 → 154) → reshape to 7 × 22

↓ softmax per position → probability over 22 characters

Output (predicted clean name)

The model is a plain two-layer MLP — no attention, no recurrence. It sees all positions at once and predicts them all in parallel. The vocabulary has 22 symbols: 20 letters that appear in the names, plus a padding token _ and a mask token ?.

\mathcal{L} = -\frac{1}{L}\sum_{i=1}^{L} \log\, p_\theta(x_{0,i} \mid x_t,\; t)

The loss is cross-entropy averaged over all positions. The model must predict every character of the original name — whether that position was masked or not. Positions it can see give easy signal; masked positions force real learning. Time conditioning via t tells the model how noisy the input is, so it can calibrate its confidence accordingly.

During training, each step picks a random name, a random noise level t ~ Uniform(0, 1), corrupts the name, and backpropagates the prediction error. Over time the model learns the structure of Indian names — common character patterns, typical lengths, where vowels and consonants tend to fall.

IV. Train in your browser

Below is a live neural network. Click Train to initialize random weights and run stochastic gradient descent. Each training step picks a random name, a random noise level, corrupts the name, and backpropagates. The loss should drop from ~3.1 (random guessing over 22 characters) to near zero.

After training completes, click +500 more to continue training for extra epochs. You can adjust the learning rate and see how it affects convergence. A high learning rate trains fast but can overshoot. A low one is stable but slow.

loss epoch

All ~28,000 parameters are trained here in vanilla JavaScript — no server, no GPU, no ML libraries. Forward pass, cross-entropy loss, and backpropagation are computed step by step with raw matrix arithmetic.

V. Denoise step by step

Once trained, generation starts from a fully masked sequence ? ? ? ? ? ? ? and runs the model repeatedly. At each step the model predicts clean characters for every position. We reveal a fraction of the masked positions and repeat, gradually lowering the noise level until the sequence is fully determined.

The temperature controls randomness: lower values make the model more confident (more likely to pick training names exactly), higher values introduce variation (sometimes producing novel combinations). The number of steps controls how gradually positions are revealed.

Denoising timeline

Each row shows the sequence state at one denoising step. Newly revealed characters are highlighted in purple. The final row is a clean pass at t = 0 that polishes the result. Try different temperatures — low values reproduce training names, high values explore new combinations.

VI. What we built

This toy model is BERT-style masked denoising + timestep conditioning + iterative sampling. It is the simplest practical diffusion-like model for character sequences, and it captures all the essential ingredients of discrete diffusion for text.

1. Corruption is the noise

Random masking plays the role of Gaussian noise in image diffusion.

Instead of adding continuous noise to pixel values, we replace discrete tokens with a mask symbol. The noise level t controls how much is hidden.

2. Whole-sequence prediction

The model predicts every position at once, not one at a time.

This is the key departure from autoregressive models. The model sees the full (corrupted) context and produces a complete clean prediction in parallel.

3. Time tells the model how noisy the input is

Conditioning on `t` lets a single model handle all noise levels.

At high t, the model must guess from almost nothing. At low t, it mostly confirms what it already sees. One set of weights serves the whole corruption spectrum.

4. Iterative refinement at generation

Sampling is a loop: predict, reveal some, repeat.

Starting from all masks, each step reveals a few more characters. This iterative structure — not the architecture — is what makes it "diffusion."

From toy to real. The real versions of this idea (MDLM, SEDD, Plaid) use transformers, variable lengths, and much larger vocabularies. But the mechanism you just saw — mask, predict, reveal — is the same one at the core. The key insight is that you do not need Gaussian diffusion or continuous embeddings to get diffusion-style generation. Discrete masking is enough.

VII. From words to answers

The name generator creates sequences from nothing — unconditional generation. But diffusion can also do conditional generation: given a question, produce the answer. The only change is what gets masked.

We will train a second model on 30 English-to-Hindi word pairs. Each training example looks like hello>namaste. The question and separator > are never masked — only the answer gets corrupted. The model learns to denoise the answer while reading the question as fixed context.

Clean

Corrupted

Key

question (never masked) separator masked answer visible answer

Training dataset — 30 English → Hindi pairs

Loss is computed only on answer positions. The model sees the question as context and learns what answer should follow. At inference, you type a question, mask the entire answer region, and iteratively denoise — exactly the same loop as before, but now the question steers the output.

\mathcal{L}_{\text{cond}} = -\frac{1}{|\mathcal{A}|} \sum_{i \in \mathcal{A}} \log\, p_\theta(x_{0,i} \mid x_t,\; t) \qquad \mathcal{A} = \{\text{answer positions}\}

Train the translator