Vision Language Models from Scratch
Educational tutorials on building and understanding Vision Transformers
Welcome
This website contains educational notebooks that teach you how Vision Transformers (ViT) work, from high-level usage to implementation details.
Learning Path
This series of notebooks takes you on a journey from using Vision Transformers to implementing advanced self-supervised learning techniques. Follow them in order for the best learning experience.
1. Vision Transformer Usage
Start here! Learn how to use a pre-trained Vision Transformer for image classification.
Topics:
- Loading and using pre-trained ViT models (timm library)
- Understanding model architecture (patches, embeddings, attention heads)
- Fine-tuning ViT on CIFAR-10
- Visualizing predictions and confusion matrices
Level: Beginner Time: 20-30 minutes
2. Create Segmentation Dataset
Build a custom synthetic dataset for segmentation tasks.
Topics:
- Creating programmatic datasets with shapes
- Generating image-mask pairs
- Saving and loading datasets from disk
- Dataset design for segmentation
Level: Beginner Time: 15-20 minutes
3. ViT Segmentation
Adapt Vision Transformers for pixel-level segmentation tasks.
Topics:
- Modifying ViT for dense prediction (segmentation)
- Building decoder heads for patch-to-pixel upsampling
- Training on custom segmentation datasets
- Evaluating with IoU metrics
Level: Intermediate Time: 30-40 minutes
4. Vision Transformer from Scratch
Build a complete Vision Transformer from the ground up in PyTorch.
Topics:
- Implementing patch embedding from scratch
- Building self-attention mechanisms (single-head and multi-head)
- Adding residual connections and layer normalization
- Complete ViT architecture implementation
- Training and evaluation on CIFAR-10
Level: Intermediate to Advanced Time: 45-60 minutes
5. Masked Dataset & Autoencoder
Coming next: MAE (Masked Autoencoder) - Prepare for self-supervised pre-training.
Topics:
- Creating masked image datasets (75% masking)
- Building convolutional autoencoders
- Skip connections (U-Net style)
- Reconstruction loss and training
- Next step: Implement MAE with Vision Transformers
Level: Advanced Time: 25-35 minutes
Prerequisites
All notebooks require:
- Python 3.8+
- PyTorch
- torchvision
- timm (for pre-trained models)
- matplotlib
Install dependencies:
uv pip install torch torchvision timm matplotlib
About
These notebooks are designed to be educational and hands-on. Each tutorial builds your understanding progressively, from using pre-trained models to implementing architectures from scratch.
Goals:
- Understand how Vision Transformers process images
- Learn the role of the CLS token in classification
- Build intuition for self-attention in vision tasks
- Implement ViT components yourself
Happy learning!