Train a YOLO for Nostril Detection on Thermal — Hierarchical Pipelines That Actually Work

Instead of trying to make MediaPipe FaceMesh (RGB-trained) work on thermal crops, train a tiny YOLOv8n directly to detect ‘nostril’ as an object class on ThermEval-D thermal frames. With 60 training crops, the YOLO hits 88% detection rate and 1.8 px median accuracy on 60 held-out frames — beating the best MediaPipe pipeline (23%/7.5 px) by 4x on detection and 4x on accuracy. The right hierarchical pipeline for thermal isn’t ‘face detector + RGB-trained FaceMesh’; it’s ‘face crop + a dedicated thermal-trained nostril detector’.
computer-vision
keypoint-detection
YOLO
MediaPipe
BlazeFace
hierarchical-detection
two-stage
thermal-imaging
ThermEval
Author

Nipun Batra

Published

May 20, 2026

Modified

May 21, 2026

The previous version of this post tried the textbook hierarchical pipeline for small-landmark detection — face detector → crop → MediaPipe FaceMesh → nostril keypoint — and showed it barely improves over single-stage MediaPipe on ThermEval-D thermal scenes. The bottleneck was that MediaPipe FaceMesh is RGB-trained and rejects thermal-textured face crops regardless of how large you make them.

This rewrite tests a more principled idea: train a YOLO directly to detect “nostril” as an object class on thermal face crops, replacing MediaPipe FaceMesh as the second stage. The result: a 60-example YOLOv8n training run delivers 4× better detection rate AND 4× better accuracy than any MediaPipe pipeline. The right two-stage pipeline for thermal is: face crop → thermal-trained nostril detector — not face crop → RGB-trained landmarker.

Code: posts/nostril-hierarchical/scripts/build_yolo_dataset.py, hier_yolo_thermeval.py, make_yolo_charts.py.

The five pipelines

To isolate what hierarchical adds vs what swapping the second stage adds, I compare all combinations:

(A) Single-stage MP        frame --> MediaPipe FaceMesh (built-in BlazeFace + mesh)
                                                  --> nostril keypoint

(B) Hierarchical Blaze+MP  frame --> BlazeFace full-range --> face bbox + 25% pad
                                                  --> crop 256x256
                                                  --> MediaPipe FaceMesh on crop
                                                  --> nostril keypoint

(C) Hierarchical Blaze+YOLO frame --> BlazeFace full-range --> face bbox
                                                  --> crop 256x256
                                                  --> YOLO-nostril on crop  ← swap-in here
                                                  --> nostril bbox center

(D) Hierarchical GT+YOLO   frame --> GT Person bbox       ← upper-bound face localiser
                                                  --> crop 256x256
                                                  --> YOLO-nostril on crop

(E) Raw YOLO no crop       frame --> YOLO-nostril directly on the full 192x256 frame
                                                  --> nostril bbox center

Pipelines C, D, and E use a YOLOv8n trained on 60 ThermEval crops with (face_crop_256x256, nostril_bbox) pairs — a one-class detector (names: ["nostril"]). Training time on a single RTX A5000 GPU: ~1 minute.

How I built the YOLO training set

For each ThermEval-D frame where a Person polygon contains a Nose polygon, I crop the Person bbox + 25% padding, resize to 256×256, and write a YOLO label file: 0 cx_norm cy_norm w_norm h_norm. Splits: 60 train / 20 val / 120 test (image-disjoint).

The training run, with default ultralytics hyperparameters, takes ~1 minute and produces a 6.2 MB model:

yolo train data=/path/to/nostril-yolo/data.yaml \
          model=yolov8n.pt \
          epochs=80 imgsz=256 batch=16 device=0

That’s it. No frozen backbones, no custom heads, no clever tricks. Off-the-shelf ultralytics with the right data.

Headline result

60 ThermEval-D test frames, 60 ground-truth nose centroids. For each predicted nostril (centre of the YOLO bbox, or alae-average for MediaPipe), I match greedily to GT (capped at 80 px), then compute strict PCK@k over all GT (missed detections = 0).

Detection rate over all 60 GT noses. YOLO-based pipelines (D, E) hit 88%; the best MediaPipe pipeline (A) gets 23%. That’s a 4× gap, on the same data, with no other change.

Strict PCK\@10px (over all GT, missed = 0). YOLO + GT-bbox hierarchical (D) lands at 85% — the practical ceiling on this dataset given annotation noise. Raw-YOLO without cropping (E) is at 78%. Both crush the MediaPipe pipelines.

Median nostril error in pixels (on matched predictions only). The YOLO pipelines hit 1.8–2.5 px — better than the Sapiens2 finetune in part 2 (5.0 px) and on par with DWPose zero-shot (part 1, 2.7 px).

The clean two-axis view:

ThermEval-D 5-pipeline comparison: upper-right = best (high detection + low error). The YOLO-on-thermal pipelines (D and E) sit in the upper-right corner; everything MediaPipe is in the lower-left.

What the cropping actually buys you

Compare pipelines D (hierarchical YOLO with GT face bbox) vs E (raw YOLO on full frame, no crop):

Pipeline Detection rate Median err Precision Time
(D) Hierarchical GT+YOLO 88% 1.8 px 79% 19 ms
(E) Raw YOLO no crop 88% 2.5 px 51% 16 ms

Same recall (88% in both — every actually-detectable nostril is found by both), but the hierarchical version has half as many false positives. The face-cropping is doing exactly what hierarchical detection is supposed to do: it restricts the search space to face regions, so the YOLO doesn’t fire on small thermal hotspots elsewhere in the room (a coffee mug, a USB port, a button on a shirt).

Precision — fraction of predicted bboxes that hit a real nostril. Hierarchical (D) keeps 79% of predictions valid; raw YOLO (E) drops to 51% because half its predictions are background hotspots that vaguely resemble nostril shape on thermal.

So the right reading of the YOLO/MediaPipe comparison is two separate effects:

  1. Swap MediaPipe for thermal-trained YOLO: detection rate goes from 10-23% to 88%, median error from 5-7 px to 2-3 px. This is the modality-matched-model effect — the dominant one.
  2. Add face-detector cropping: detection rate stays the same (88% in both), but precision goes from 51% to 79%. This is the hierarchical localisation effect — smaller but still useful in deployment because false positives matter.

What it looks like — two representative frames

Each “frame” panel shows the same input through all five pipelines (single MP, hier-Blaze MP, hier-Blaze YOLO, hier-GT YOLO, raw YOLO). Red crosses are GT nose centroids; coloured dots are predictions.

Two ThermEval-D frames × five pipelines. Top half (single subject, well-lit thermal): only (D) and (E) — the YOLO-nostril pipelines — find the nostril. (A) and (B) miss it entirely. Bottom half is similar. The two right-most columns are the empty (E) panels in this layout.

The dot colours per pipeline: (A) green, (B) orange, (C) yellow, (D) magenta, (E) white.

When does the face-cropping stage really matter?

The current ThermEval-D data has a fairly constrained background — indoor scenes with relatively uniform thermal environment. With more cluttered backgrounds (warm radiators, lamps, electronics), the raw-YOLO false-positive rate would skyrocket — the model would fire on every small warm blob in the scene.

Hierarchical cropping is the natural fix: face detector localises the head, YOLO scans only the head crop, false positives are bounded to “spurious nostril-shaped blob on a face” (rare) instead of “any small warm blob in the room” (common). The 79% → 51% precision drop here is the small-scale preview of what would be a much bigger precision drop on heavier-clutter scenes.

What about the upstream face detector?

Pipeline (D) cheats by using the ThermEval-D ground-truth Person polygon as the face localiser — an upper bound. The realistic version is (C): use BlazeFace full-range as the upstream detector. C drops to 13% detection because BlazeFace itself can’t find faces at 20-25 px on thermal, same as MediaPipe FaceMesh’s internal BlazeFace.

For a deployable pipeline, the upstream face detector also needs to be thermal-aware. Two options:

  1. DWPose (part 1 winner). Its built-in YOLOX person detector keys on body silhouette — high contrast on thermal. Get a Person bbox → crop → YOLO-nostril. This is the strongest end-to-end pipeline.
  2. Train a second YOLO for face/head detection on thermal. Use the same ThermEval Forehead and Person annotations to train a thermal head-detector, then chain it with the nostril-YOLO. ~1 minute per training run; ~30 labels each.

Speed

Per-frame inference time on a single RTX A5000. Raw YOLO (E) is the fastest at 16 ms (single forward pass on the 192×256 frame). Hierarchical pipelines add the face detector cost (10-15 ms). Single-stage MediaPipe is 25 ms; hier-Blaze-MP is 34 ms because it runs both BlazeFace and FaceMesh.

All five are below 35 ms / frame → 30+ fps real-time on a low-end GPU, well-within video-rate.

Caveats

  • The training set is small (60 crops). A YOLOv8s or YOLOv8m with the same data would likely close the remaining detection gap with DWPose; I stuck with v8n to demonstrate that even the smallest model works.
  • The validation set is even smaller (20 crops). The 58.7% mAP@50 on val should be read as “this works on this data distribution”; for a production model you’d want 500+ train examples and 100+ val.
  • Single-class detector (just “nostril”). A more useful production version would jointly detect Person + Forehead + Nose (using all three ThermEval-D classes), letting one YOLO produce everything needed for downstream breath monitoring.
  • n=60 test frames. Same protocol as parts 1-2.
  • GT bbox in (D) is from the polygon, not a learned detector. Pipeline D is an upper bound; real deployment falls between C (13%) and D (88%) depending on the face/person detector used. With DWPose as the upstream detector, you’d be much closer to D.

What’s next

  • Part 4 discusses the still-unsolved problems (multi-camera, video temporal smoothness, occlusion under bedding) — the parts that even the YOLO-nostril pipeline doesn’t address.
  • Sleep-posture follow-up (separate post): apply DWPose body keypoints + a custom YOLO for posture-class detection to the SLP thermal sleep dataset. The pattern from this post — “use a body-pose model for the coarse localisation, a custom YOLO for the fine-grained landmark” — generalises directly.