Vision Language Models from Scratch

Educational tutorials on building and understanding Vision Transformers

Welcome

This website contains educational notebooks that teach you how Vision Transformers (ViT) work, from high-level usage to implementation details.

Learning Path

This series of notebooks takes you on a journey from using Vision Transformers to implementing advanced self-supervised learning techniques. Follow them in order for the best learning experience.


1. Vision Transformer Usage

Start here! Learn how to use a pre-trained Vision Transformer for image classification.

Topics:

  • Loading and using pre-trained ViT models (timm library)
  • Understanding model architecture (patches, embeddings, attention heads)
  • Fine-tuning ViT on CIFAR-10
  • Visualizing predictions and confusion matrices

Level: Beginner Time: 20-30 minutes


2. Create Segmentation Dataset

Build a custom synthetic dataset for segmentation tasks.

Topics:

  • Creating programmatic datasets with shapes
  • Generating image-mask pairs
  • Saving and loading datasets from disk
  • Dataset design for segmentation

Level: Beginner Time: 15-20 minutes


3. ViT Segmentation

Adapt Vision Transformers for pixel-level segmentation tasks.

Topics:

  • Modifying ViT for dense prediction (segmentation)
  • Building decoder heads for patch-to-pixel upsampling
  • Training on custom segmentation datasets
  • Evaluating with IoU metrics

Level: Intermediate Time: 30-40 minutes


4. Vision Transformer from Scratch

Build a complete Vision Transformer from the ground up in PyTorch.

Topics:

  • Implementing patch embedding from scratch
  • Building self-attention mechanisms (single-head and multi-head)
  • Adding residual connections and layer normalization
  • Complete ViT architecture implementation
  • Training and evaluation on CIFAR-10

Level: Intermediate to Advanced Time: 45-60 minutes


5. Masked Dataset & Autoencoder

Coming next: MAE (Masked Autoencoder) - Prepare for self-supervised pre-training.

Topics:

  • Creating masked image datasets (75% masking)
  • Building convolutional autoencoders
  • Skip connections (U-Net style)
  • Reconstruction loss and training
  • Next step: Implement MAE with Vision Transformers

Level: Advanced Time: 25-35 minutes


Prerequisites

All notebooks require:

  • Python 3.8+
  • PyTorch
  • torchvision
  • timm (for pre-trained models)
  • matplotlib

Install dependencies:

uv pip install torch torchvision timm matplotlib

About

These notebooks are designed to be educational and hands-on. Each tutorial builds your understanding progressively, from using pre-trained models to implementing architectures from scratch.

Goals:

  • Understand how Vision Transformers process images
  • Learn the role of the CLS token in classification
  • Build intuition for self-attention in vision tasks
  • Implement ViT components yourself

Happy learning!