DINOv2 vs CNN: Transfer Learning vs Training from Scratch
Comparing DINOv2 fine-tuning vs training CNN from scratch for binary image classification
Deep Learning
Computer Vision
Author
Nipun Batra
Published
August 17, 2025
DINOv2 vs CNN: Transfer Learning vs Training from Scratch
This notebook compares two approaches for binary image classification:
Fine-tuning DINOv2: Using Meta’s self-supervised vision transformer
Training from Scratch: Building a CNN classifier from the ground up
We’ll examine: - Training time and convergence differences - Data efficiency with limited samples - Final performance comparison - Parameter efficiency
What is DINOv2?
DINOv2 is Meta’s self-supervised vision transformer that learns robust visual representations without labels. Key features:
Self-supervised learning: Trained on millions of images without human annotations
Strong representations: Captures semantic and geometric features
Transfer learning: Excellent for fine-tuning on downstream tasks
[notice] A new release of pip is available: 25.0.1 -> 25.2[notice] To update, run: pip install --upgrade pip
Installing scikit-learn...
All packages installed successfully!
[notice] A new release of pip is available: 25.0.1 -> 25.2[notice] To update, run: pip install --upgrade pip
# Core importsimport osimport timeimport randomimport warningsimport numpy as npimport pandas as pdfrom PIL import Imageimport matplotlib.pyplot as pltimport seaborn as snswarnings.filterwarnings('ignore')plt.style.use('default')sns.set_palette("husl")
# PyTorch importsimport torchimport torch.nn as nnimport torch.optim as optimimport torchvision.transforms as transformsfrom torch.utils.data import DataLoader, Dataset# Transformers and datasetsfrom transformers import AutoImageProcessor, AutoModelfrom datasets import load_dataset# Evaluationfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Set random seeds for reproducibilitytorch.manual_seed(42)np.random.seed(42)random.seed(42)# Check GPU availabilitydevice = torch.device('cuda'if torch.cuda.is_available() else'cpu')print(f"Using device: {device}")
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
# Training summaryprint("\n"+"="*60)print("TRAINING SUMMARY")print("="*60)print(f"DINOv2 - Best Val Accuracy: {dinov2_best_acc:.2f}% in {dinov2_time:.1f}s ({dinov2_epochs} epochs)")print(f"CNN - Best Val Accuracy: {cnn_best_acc:.2f}% in {cnn_time:.1f}s ({cnn_epochs} epochs)")print(f"\nTime per epoch:")print(f"DINOv2: {dinov2_time/dinov2_epochs:.1f}s/epoch")print(f"CNN: {cnn_time/cnn_epochs:.1f}s/epoch")
============================================================
TRAINING SUMMARY
============================================================
DINOv2 - Best Val Accuracy: 84.21% in 1312.2s (8 epochs)
CNN - Best Val Accuracy: 93.98% in 534.2s (15 epochs)
Time per epoch:
DINOv2: 164.0s/epoch
CNN: 35.6s/epoch
Test Set Evaluation
# Load best modelsdinov2_model.load_state_dict(torch.load('best_dinov2_model.pth'))cnn_model.load_state_dict(torch.load('best_cnn_model.pth'))criterion = nn.CrossEntropyLoss()print("Best models loaded!")
Best models loaded!
# Evaluate both models on test setdinov2_test_loss, dinov2_test_acc, dinov2_preds, dinov2_labels = evaluate_model( dinov2_model, dinov2_test_loader, criterion, device)cnn_test_loss, cnn_test_acc, cnn_preds, cnn_labels = evaluate_model( cnn_model, cnn_test_loader, criterion, device)print("\n"+"="*60)print("TEST SET EVALUATION")print("="*60)print(f"DINOv2 - Test Accuracy: {dinov2_test_acc:.2f}%, Test Loss: {dinov2_test_loss:.4f}")print(f"CNN - Test Accuracy: {cnn_test_acc:.2f}%, Test Loss: {cnn_test_loss:.4f}")
============================================================
TEST SET EVALUATION
============================================================
DINOv2 - Test Accuracy: 83.59%, Test Loss: 0.3800
CNN - Test Accuracy: 91.41%, Test Loss: 0.2076
# Create comparison tablecomparison_data = {'Metric': ['Model Parameters','Training Epochs','Learning Rate','Total Training Time (s)','Time per Epoch (s)','Best Validation Accuracy (%)','Test Accuracy (%)','Test Loss' ],'DINOv2': [f'{dinov2_params:,}', dinov2_epochs, dinov2_lr,f'{dinov2_time:.1f}',f'{dinov2_time/dinov2_epochs:.1f}',f'{dinov2_best_acc:.2f}',f'{dinov2_test_acc:.2f}',f'{dinov2_test_loss:.4f}' ],'CNN from Scratch': [f'{cnn_params:,}', cnn_epochs, cnn_lr,f'{cnn_time:.1f}',f'{cnn_time/cnn_epochs:.1f}',f'{cnn_best_acc:.2f}',f'{cnn_test_acc:.2f}',f'{cnn_test_loss:.4f}' ]}comparison_df = pd.DataFrame(comparison_data)print("\n"+"="*80)print("COMPREHENSIVE MODEL COMPARISON")print("="*80)print(comparison_df.to_string(index=False))
================================================================================
COMPREHENSIVE MODEL COMPARISON
================================================================================
Metric DINOv2 CNN from Scratch
Model Parameters 86,679,170 6,813,442
Training Epochs 8 15
Learning Rate 0.0001 0.001
Total Training Time (s) 1312.2 534.2
Time per Epoch (s) 164.0 35.6
Best Validation Accuracy (%) 84.21 93.98
Test Accuracy (%) 83.59 91.41
Test Loss 0.3800 0.2076
# Key insightsprint("\n"+"="*80)print("KEY INSIGHTS")print("="*80)accuracy_diff = dinov2_test_acc - cnn_test_acctime_ratio = cnn_time / dinov2_timeprint(f"Accuracy difference: {accuracy_diff:.1f} percentage points")print(f"Training time ratio: {time_ratio:.1f}x")print(f"Convergence: DINOv2 in {dinov2_epochs} epochs vs CNN in {cnn_epochs} epochs")print(f"Parameter efficiency: DINOv2 has {dinov2_params/cnn_params:.1f}x more parameters")if dinov2_test_acc > cnn_test_acc:print(f"Winner (Accuracy): DINOv2 ({dinov2_test_acc:.1f}% vs {cnn_test_acc:.1f}%)")else:print(f"Winner (Accuracy): CNN ({cnn_test_acc:.1f}% vs {dinov2_test_acc:.1f}%)")if dinov2_time < cnn_time:print(f"Winner (Speed): DINOv2 ({dinov2_time:.0f}s vs {cnn_time:.0f}s)")else:print(f"Winner (Speed): CNN ({cnn_time:.0f}s vs {dinov2_time:.0f}s)")
================================================================================
KEY INSIGHTS
================================================================================
Accuracy difference: -7.8 percentage points
Training time ratio: 0.4x
Convergence: DINOv2 in 8 epochs vs CNN in 15 epochs
Parameter efficiency: DINOv2 has 12.7x more parameters
Winner (Accuracy): CNN (91.4% vs 83.6%)
Winner (Speed): CNN (534s vs 1312s)
Conclusions
This comparison between DINOv2 fine-tuning and training CNN from scratch demonstrates:
Key Findings:
Transfer Learning: Pre-trained models can provide strong baselines with fewer epochs
Parameter Efficiency: More parameters don’t always guarantee better performance
Training Time: Self-supervised models may converge faster due to good initialization
Data Efficiency: Pre-trained features can work well with limited training data