Notes on machine learning, VLMs, and building things.
W
BA
W' = W + BA
B ∈ ℝ^(d×r)
A ∈ ℝ^(r×k)
r ≪ min(d,k)
walk
shop
clean
Two-stage paper-grade thermal SR finetune on multi-dataset HR-LR pairs (SF-TL54 + ThermEval-D, 831/180/348 train/val/test). v1 (L1-only, 60 ep, 128-px patches): val PSNR 36.39 → 38.60, but the downstream DWPose nostril error gets WORSE (the classic perception-distortion trade-off). v2 (L1 + 0.05·LPIPS + EMA, 200 ep, 192-px patches): test LPIPS drops from 0.094 to 0.029 (3.2× better), AND the downstream nostril error drops to 1.18 px median — BETTER than zero-shot. v2 trades a tiny PSNR drop (38.68 vs v1’s 39.16) for genuinely better deployment-relevant metrics. This is the recipe for a PBVS TISR submission that targets the downstream task, not just the leaderboard PSNR.
Ai2 released OlmoEarth v1.1 on May 19. This post unpacks what changed and runs a tiny embedding + k-NN demo on synthetic water / forest / urban Sentinel-2 chips using the 3.5 M-param Nano model. Runs in under 5 seconds on a laptop.
I ran Gemini 3.1 Flash Image Preview (Nano Banana 2) AND Gemini 3 Pro Image Preview (Nano Banana Pro) through four prompting strategies for generating LWIR thermal face images, then asked Gemini-3.5 to judge each against a real thermal photograph. All 14 outputs (7 per Gemini model) make the same physical errors. The bigger Pro averages +0.9 judge points but ceiling stays at 7/10 — pure scaling doesn’t fix the modality gap. The literature priors (T-FAKE CVPR 2025, ThermalGen NeurIPS 2025, ThermVision-DB 2026) all use finetuned generators with physics-aware losses. So I ran ThermalGen’s pretrained RGB→thermal model on the same image — and it produces visibly more correct thermal physics (canthi-correct eyes, cool hair, no glowing pupils) than any Gemini output. Surprising kicker: the Gemini-3.5 judge scores ThermalGen LOWER (3-4/10) than Gemini’s own outputs (7/10), because the judge is biased toward Gemini-style ‘fancy-thermal-looking’ images. Don’t use LLMs to judge thermal physics.
Six super-resolution methods tested on a 4× thermal upscaling task (116×87 → 464×348): bicubic, EDSR, MSRN, A2N, DRLN (all RGB-trained CNNs), and Stable Diffusion x4 upscaler (diffusion). Classical CNN-based SR cleanly beats both bicubic and the diffusion upscaler on every metric and on downstream nostril localisation. DRLN tops PSNR/LPIPS; A2N achieves 0.7 px nostril error on a 256-px-wide face. The diffusion upscaler is the most visually sharp but the LEAST accurate downstream — it hallucinates plausible-looking detail that breaks pixel alignment, dropping nose localisation accuracy 7× vs bicubic.
Sapiens2-0.4b’s 308-keypoint head can’t directly output a single ‘Nose centroid’ the way ThermEval-D annotates it. So I freeze the backbone, replace the head with a 410k-parameter 1-keypoint heatmap regressor, and train on 40 ThermEval crops. Test PCK@10 = 93%, mean error = 5.5 px on the 80-crop test split. Does not beat zero-shot DWPose (99% PCK@10, 2.7 px) on this dataset — but produces a predictor pinned to YOUR anatomical convention, distillable to a tiny inference model, and reproducible on data DWPose wasn’t trained for. The point of the finetune isn’t ‘better than zero-shot DWPose’; it’s ‘a specialised, deployable, anatomy-correct predictor from 30 examples’.
Instead of trying to make MediaPipe FaceMesh (RGB-trained) work on thermal crops, train a tiny YOLOv8n directly to detect ‘nostril’ as an object class on ThermEval-D thermal frames. With 60 training crops, the YOLO hits 88% detection rate and 1.8 px median accuracy on 60 held-out frames — beating the best MediaPipe pipeline (23%/7.5 px) by 4x on detection and 4x on accuracy. The right hierarchical pipeline for thermal isn’t ‘face detector + RGB-trained FaceMesh’; it’s ‘face crop + a dedicated thermal-trained nostril detector’.
RGB has skin texture; thermal does not. A small dark blob inside a face on thermal looks like every other small dark blob — anywhere on the face, anywhere in the room. This post unpacks why MediaPipe and hierarchical pipelines fail on the ThermEval-D bake-off (per parts 1 and 3) and proposes four research-grade fixes: anatomical priors as a regulariser, paired-RGB label projection, T-FAKE synthetic pretraining, and using the breath-rate signal itself as self-supervision.
Off-the-shelf wholebody pose / face mesh models on the ThermEval-D real-world thermal dataset (192x256, indoor multi-person scenes). On the smallest-face thermal regime that actually matches real deployment, DWPose wins decisively: 100% detection rate, 99% PCK@10px. Sapiens2-0.4b is accurate (2 px median on the noses it finds) but its single-person inference misses 32% of multi-person scenes. MediaPipe FaceMesh detects only 14% of GT noses because the 20-25 px faces in ThermEval are below BlazeFace short-range’s working scale. The pipeline lesson is that for thermal deployment you need a model that includes its own multi-person face detector — not just a strong landmark head.
Google announced Gemini 3.5 Flash and Gemini Omni at I/O 2026. I run 3.5 Flash against my April 3.1 baseline on the same Gemini API: thinking levels, tokens-per-second, a 500K-token needle test, agentic tool use, and a coding micro-benchmark. Omni is not yet on the developer API, so the second half is notes from the announcement rather than measurements.
A small translator that lowers Vega-Lite JSON specs to readable Matplotlib code. v2 reworks it as a visitor class and adds aggregations, log scales, size encoding, area marks, layered charts, faceting, and filter transforms.
Companion to the BharatEO MAE post: a pedagogical RS-CLIP-style model trained on the same ESA WorldCover ROIs. Tiny image and text encoders, InfoNCE loss, zero-shot land cover classification, image-text retrieval.
A pedagogical BharatEO-v0 experiment on real ESA WorldCover 2021 v200 pixels: 3-channel RGB MAE with L1 reconstruction, ESA patch classification, multi-ROI sampling, more chips, more epochs.
GitHub Pages was painfully slow on a campus network even without Tailscale. The root cause was partial reachability to GitHub Pages IPs, and the clean fix was split DNS with a local forwarder.
Two days after Meta’s Sapiens2 release, I run the 0.4B normal, segmentation, and 308-keypoint pose heads on an M2 Max via PyTorch MPS, then build six health-leaning downstream pipelines: joint-angle readout, body symmetry, tele-dermatology ROI extraction, foreground-only relighting, hair recolour, and a 12-frame gait analysis on video.
mx.compile
mx.fast.*
An educational walkthrough of MLX’s three kernel-fusion paths — graph compile, the mx.fast.* fused primitives, and custom Metal kernels — with measured speedups for training and inference on an M2 Max.
Google’s new Gemini 3.1 Flash TTS adds natural-language ‘audio tags’ like [excitement] or [like dracula] that steer voice style inline. I run it in English and Hindi and embed the results.
Hi-res ESRI imagery + Gemma 4 in the PR #926 agent loop on three brick-kiln classes. Twelve tiles, four Falcon queries each, three prompt variants — with per-tile mask grids and Gemma’s reasoning shown. 67% zero-shot, and the remaining 33% is instructive.
Two worked examples — a still photo and a drone video — running Falcon Perception and Gemma 4 locally on an M2 Max. Counts you can audit, from an agent that never does arithmetic in its head.
Gemma 4 sounds confident. Falcon Perception actually measures. Together they beat either alone — here is the evidence.
A hands-on guide to automating web browsing for AI agents using agent-browser CLI
Learn how to build MCP servers with FastMCP - enabling LLMs like Claude to interact with your custom tools, databases, and APIs.
Configure Brave browser to hide personal browsing history during classroom demos and presentations
A comprehensive terminal optimization covering security fixes, modern tooling, and performance improvements
Learn how to build neural networks that gracefully handle missing input channels by explicitly encoding missingness patterns
Understanding vector databases through text and image search
Learn how to call REST APIs from Google Sheets using a custom Apps Script function.
Comparing DINOv2 fine-tuning vs training CNN from scratch for binary image classification
Visualizing ERA5 data grid coverage over India
Exploring how temperature scaling affects the randomness and diversity of language model outputs through mathematical analysis and interactive visualizations
Visualizing and filtering multi-image classification samples from the VOC dataset for improved training data quality
Demonstrating how logit masking enforces valid transitions in sleep stage classification
Setting up Python Environment on Linux Remote Servers with GPU Support
Tired of waiting 5+ minutes for Quarto to rebuild all 145 posts when you only changed one? Here’s how to make GitHub Actions only rebuild what actually changed.
Transferring large projects with thousands of small files over SSH can be painfully slow. Here’s how we solved it with parallel transfers.
Keyboard shortcuts on mac
from ultralytics import YOLO, checks, hub import pandas as pd
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset import numpy as np import matplotlib.pyplot as plt
import numpy as np import matplotlib.pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'retina' import torch import torch.nn as nn import torch.nn…
import numpy as np import matplotlib.pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'retina'
import numpy as np import matplotlib.pyplot as plt %matplotlib inline import torch import torch.nn as nn import torch.nn.functional as F %config…
import torch import torch.nn as nn import matplotlib.pyplot as plt import numpy as np # Retina mode %config InlineBackend.figure_format = 'retina'
# Create…
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
encoding.encode("Hello World! This is a simple notebook")
[9906, 4435, 0, 1115…
import numpy as np import time import matplotlib.pyplot as plt import pandas as pd # Retina display %config InlineBackend.figure_format = 'retina'
log_size …
import matplotlib.pyplot as plt import torch %matplotlib inline %config InlineBackend.figure_format='retina'
# Download some MNIST to demonstrate…
import networkx as nx import numpy as np import matplotlib.pyplot as plt import pandas as pd %matplotlib inline # Retina display %config InlineBackend.figure_format =…
import numpy as np import matplotlib.pyplot as plt import torch import seaborn as sns import pandas as pd dist =torch.distributions sns.reset_defaults() sns.set_co…
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim import numpy as np import matplotlib.pyplot as plt from torch.utils.da…
from jax import vmap, jit, grad, vmap import jax.numpy as jnp # Enable 64-bit mode from jax.config import config config.update("jax_enable_x64", True) import matplotl…
import jax.numpy as jnp import jax from jax import random import tensorflow_probability.substrates.jax as tfp tfd = tfp.distributions import pandas as pd import matpl…
Some useful tidibts in sympy
A programming introduction to Autoencoders in JAX
Probability Calibration
Multi-output Gaussian Process
import numpy as np import matplotlib.pyplot as plt import torch import seaborn as sns from functools import partial sns.reset_defaults() sns.set_context(context="ta…
import numpy as np import matplotlib.pyplot as plt import torch import seaborn as sns import pandas as pd import pyro dist =pyro.distributions sns.reset_defaults() …
import numpy as np import matplotlib.pyplot as plt import torch import seaborn as sns import pandas as pd t_dist =torch.distributions sns.reset_defaults() sns.set_…
import torch dist = torch.distributions import matplotlib.pyplot as plt import seaborn as sns import numpy as np %matplotlib inline
import torch from jax import grad import jax.numpy as jnp
learn
How to learn the parameters of a GP
using Plots theme(:default) using LinearAlgebra using LaTeXStrings
Blurring an image selectively using Affinity Photo
Audio filtering techniques and applications
Running Python scripts on server over ssh and getting back content
Some of my shortcuts on the iPad
My iPad computing setup
My Mac Setup
Implementation and visualization of Generative Adversarial Networks
Using GPy and some interactive visualisations for understanding GPR and applying on a real world data set
From the ground up!
A programming introduction to Active Learning with Bayesian Linear Regression.
A programming introduction to NNs.
Simple scripts for downloading weather data
A programming introduction to Bayesian Linear Regression.
A minimal example of using markdown with fastpages.
An interactive exploration of Gaussian processes.
HashMaps for programming interviews
How is the world changing over the years!
AQ sensing in India
A programming introduction to query by committee strategy for active learning
Denoising
Some personal reflections..
Neural networks to learn the embeddings! and how to combine them
Adagrad optimizer for matrix factorisation
What if we start from some prior!
Exploring data in Matplotlib
Constrained NMF using CVXPY!
Out of tensor factorisation
What if we to predict for entries not within the matrix?!
Towards amazing plots in research papers!
Maximize based on what you know, re-estimate!
Simulating a continuous HMM
--- title: "Writing" subtitle: "Notes on machine learning, VLMs, and building things." listing: contents: posts sort: "date desc" type: default categories: true sort-ui: false filter-ui: false fields: [image, date, title, description, categories, author, reading-time] page-layout: full ---