VLM from Scratch
  • Home
  • Parts
    • 01. Minimal VLM
    • 02. Object Detection
    • 03. Visual QA
    • 04. Multi-Task
    • 05. Multi-Image
    • 06. Task Routing
    • 07. Chain of Thought
    • 08. Referring Segmentation
    • 09. Image Editing
    • 10. Compression
    • 11. Image Generation
    • 12. OCR to LaTeX

VLM from Scratch

Build Vision-Language Models from the ground up.

A 12-part journey from a simple image captioner to advanced multi-modal AI systems. No magic, no black boxes — just pure understanding.

Start Learning View on GitHub

What You’ll Build

12

Hands-on notebooks

8+

Vision tasks mastered

~700M

Parameters trained

0

Black boxes

The Journey

01

Minimal VLM

Connect a Vision Transformer to a Language Model. Train on Flickr8k. Generate your first captions.

Start here →

02

Object Detection

Instruction-tune for detection. Output structured JSON. Predict bounding boxes.

Continue →

03

Visual QA

Answer questions about images. Train on A-OKVQA. Build visual understanding.

Continue →

04

Multi-Task

One model, three tasks. Unified architecture for captioning, detection, and VQA.

Continue →

05

Multi-Image

Process image sequences. Temporal reasoning. Inspired by TEOChat.

Continue →

06

Task Routing

Auto-detect task from prompt. Intelligent routing. Seamless UX.

Continue →

07

Chain of Thought

Step-by-step visual reasoning. Think before answering. Improved accuracy.

Continue →

08

Referring Segmentation

“The dog on the left” → polygon mask. Visual grounding meets segmentation.

Continue →

09

Image Editing

Natural language edits. “Make it brighter” → transformed image. VLM + PIL.

Continue →

10

Compression

Quantization, pruning, distillation. 8x smaller. Production-ready.

Continue →

11

Image Generation

VLM meets Stable Diffusion. Understand to create. Full circle.

Continue →

12

OCR to LaTeX

Math equations in images → LaTeX code. TrOCR architecture.

Continue →

The Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│    Image         Vision           Projection        Language        │
│    Input    →    Encoder     →    Layer       →     Model     →    Output
│                  (ViT)            (Linear)          (SmolLM)        │
│                                                                     │
│   224×224       768-dim           2048-dim          Text            │
│   patches       features          aligned           tokens          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Vision Encoder: ViT-Base/Large (86M-304M params) Language Model: SmolLM-135M/360M Training: PyTorch, HuggingFace Transformers Datasets: Flickr8k, A-OKVQA, RefCOCO, COCO

Prerequisites

Python

Comfortable with classes, functions, list comprehensions

PyTorch

Tensors, autograd, nn.Module basics

Deep Learning

Loss functions, backprop, training loops

~4GB VRAM

GPU recommended (Colab works great)

Ready to see how VLMs really work?

Begin with Part 1 →

Built with curiosity. Every line explained.

GitHub · LLM from Scratch · Made by Nipun Batra

Source Code
---
pagetitle: "VLM from Scratch"
toc: false
format:
  html:
    page-layout: custom
    css: styles.css
    include-in-header:
      text: |
        <link rel="preconnect" href="https://fonts.googleapis.com">
        <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
        <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
---

::: {.hero}
::: {.hero-content}
# VLM from Scratch

Build Vision-Language Models from the ground up.

::: {.hero-subtitle}
A 12-part journey from a simple image captioner to advanced multi-modal AI systems. No magic, no black boxes — just pure understanding.
:::

::: {.hero-cta}
[Start Learning](#the-journey){.btn-primary}
[View on GitHub](https://github.com/nipunbatra/vlm-from-scratch){.btn-secondary}
:::

:::
:::

::: {.section-dark}
::: {.container}
## What You'll Build

::: {.stats-grid}
::: {.stat-card}
### 12
Hands-on notebooks
:::

::: {.stat-card}
### 8+
Vision tasks mastered
:::

::: {.stat-card}
### ~700M
Parameters trained
:::

::: {.stat-card}
### 0
Black boxes
:::
:::

:::
:::


::: {.section-light}
::: {.container}
## The Journey {#the-journey}

::: {.journey-grid}

::: {.journey-card .foundation}
::: {.part-number}
01
:::
### Minimal VLM
Connect a Vision Transformer to a Language Model. Train on Flickr8k. Generate your first captions.

[Start here →](notebooks/part-01-minimal-vlm.ipynb)
:::

::: {.journey-card .instruction}
::: {.part-number}
02
:::
### Object Detection
Instruction-tune for detection. Output structured JSON. Predict bounding boxes.

[Continue →](notebooks/part-02-object-detection.ipynb)
:::

::: {.journey-card .instruction}
::: {.part-number}
03
:::
### Visual QA
Answer questions about images. Train on A-OKVQA. Build visual understanding.

[Continue →](notebooks/part-03-visual-qa.ipynb)
:::

::: {.journey-card .advanced}
::: {.part-number}
04
:::
### Multi-Task
One model, three tasks. Unified architecture for captioning, detection, and VQA.

[Continue →](notebooks/part-04-multi-task.ipynb)
:::

::: {.journey-card .advanced}
::: {.part-number}
05
:::
### Multi-Image
Process image sequences. Temporal reasoning. Inspired by TEOChat.

[Continue →](notebooks/part-05-multi-image.ipynb)
:::

::: {.journey-card .advanced}
::: {.part-number}
06
:::
### Task Routing
Auto-detect task from prompt. Intelligent routing. Seamless UX.

[Continue →](notebooks/part-06-task-routing.ipynb)
:::

::: {.journey-card .reasoning}
::: {.part-number}
07
:::
### Chain of Thought
Step-by-step visual reasoning. Think before answering. Improved accuracy.

[Continue →](notebooks/part-07-chain-of-thought.ipynb)
:::

::: {.journey-card .reasoning}
::: {.part-number}
08
:::
### Referring Segmentation
"The dog on the left" → polygon mask. Visual grounding meets segmentation.

[Continue →](notebooks/part-08-referring-segmentation.ipynb)
:::

::: {.journey-card .application}
::: {.part-number}
09
:::
### Image Editing
Natural language edits. "Make it brighter" → transformed image. VLM + PIL.

[Continue →](notebooks/part-09-image-editing.ipynb)
:::

::: {.journey-card .application}
::: {.part-number}
10
:::
### Compression
Quantization, pruning, distillation. 8x smaller. Production-ready.

[Continue →](notebooks/part-10-compression.ipynb)
:::

::: {.journey-card .application}
::: {.part-number}
11
:::
### Image Generation
VLM meets Stable Diffusion. Understand to create. Full circle.

[Continue →](notebooks/part-11-image-generation.ipynb)
:::

::: {.journey-card .application}
::: {.part-number}
12
:::
### OCR to LaTeX
Math equations in images → LaTeX code. TrOCR architecture.

[Continue →](notebooks/part-12-ocr-latex.ipynb)
:::

:::
:::
:::


::: {.section-dark}
::: {.container}
## The Architecture

::: {.architecture-visual}
```
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│    Image         Vision           Projection        Language        │
│    Input    →    Encoder     →    Layer       →     Model     →    Output
│                  (ViT)            (Linear)          (SmolLM)        │
│                                                                     │
│   224×224       768-dim           2048-dim          Text            │
│   patches       features          aligned           tokens          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```
:::

::: {.tech-stack}
**Vision Encoder**: ViT-Base/Large (86M-304M params)
**Language Model**: SmolLM-135M/360M
**Training**: PyTorch, HuggingFace Transformers
**Datasets**: Flickr8k, A-OKVQA, RefCOCO, COCO
:::

:::
:::


::: {.section-light}
::: {.container}
## Prerequisites

::: {.prereq-grid}
::: {.prereq-card}
### Python
Comfortable with classes, functions, list comprehensions
:::

::: {.prereq-card}
### PyTorch
Tensors, autograd, nn.Module basics
:::

::: {.prereq-card}
### Deep Learning
Loss functions, backprop, training loops
:::

::: {.prereq-card}
### ~4GB VRAM
GPU recommended (Colab works great)
:::
:::

:::
:::


::: {.section-cta}
::: {.container}

## Ready to see how VLMs really work?

::: {.final-cta}
[Begin with Part 1 →](notebooks/part-01-minimal-vlm.ipynb){.btn-primary-large}
:::

:::
:::


::: {.footer}
::: {.container}
Built with curiosity. Every line explained.

[GitHub](https://github.com/nipunbatra/vlm-from-scratch) · [LLM from Scratch](https://github.com/nipunbatra/llm-from-scratch) · Made by [Nipun Batra](https://nipunbatra.github.io)
:::
:::