Why VLMs Lie About Numbers (and How to Fix It With Grounding)

Ask a VLM “which side has more people?” and it will give you a confident answer. Often wrong.

The problem is not intelligence — it is that counting, measuring, and thresholding are not language tasks. A VLM that says “the bus fills almost the entire frame” has no idea whether that means 94% or 99.7%.

This post demonstrates a simple fix inspired by Yasser Dahou’s demo:

Detect with Falcon Perception — get masks, boxes, coordinates
Measure with Python — counts, widths, margins, areas (plain arithmetic)
Reason with Gemma 4 — answer using only those measured facts

We run 7 tasks head-to-head: raw Gemma vs. the grounded pipeline. The results are not close.

The Problem: VLMs Are Confidently Wrong About Numbers

VLMs process images as visual impressions — they see a dense cluster of people and conclude that side “has more.” But counting, measuring distances, and checking thresholds are not visual tasks. They require enumeration and arithmetic, not pattern recognition. The result: answers that sound right but are based on heuristic guessing, not evidence.

The Fix: Perceive, Measure, Then Reason

The fix splits the job into three components that play to each model’s strengths:

Stage	Model	What it does	Why it’s needed
Perceive	Falcon Perception	Detects objects → bounding boxes, masks, coordinates	Converts pixels into structured geometry
Measure	Python	Counts, computes widths, margins, areas, ratios	Arithmetic on detections — no hallucination possible
Reason	Gemma 4	Reads measurements → produces grounded verdict	Free to focus on logic and explanation, not visual estimation

The key insight: the final answer is auditable. Every claim traces back to a detection and a measurement. If the answer is wrong, you can see exactly where in the pipeline it went wrong — bad detection? bad measurement logic? bad reasoning? With a raw VLM, the answer is a black box.

Setup

uv pip install -U mlx-vlm supervision pillow matplotlib numpy

Code

import os, json, re
from pathlib import Path

os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches
from PIL import Image
import supervision as sv
from IPython.display import display

%config InlineBackend.figure_format = 'retina'

Matplotlib is building the font cache; this may take a moment.
TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Code

def resolve_image_path(name):
    candidates = [Path(name), Path("posts") / name]
    for path in candidates:
        if path.exists():
            return path
    raise FileNotFoundError(name)


def load_image_local(name):
    return Image.open(resolve_image_path(name)).convert("RGB")


def xyxy_from_center_size(x, y, w, h, image_w, image_h):
    cx, cy = x * image_w, y * image_h
    bw, bh = w * image_w, h * image_h
    return [cx - bw / 2, cy - bh / 2, cx + bw / 2, cy + bh / 2]


def mask_to_measurements(mask):
    mask = np.asarray(mask).astype(bool)
    ys, xs = np.nonzero(mask)
    if len(xs) == 0:
        raise ValueError("Empty mask")
    return {
        "xyxy": [float(xs.min()), float(ys.min()), float(xs.max()), float(ys.max())],
        "centroid": [float(xs.mean()), float(ys.mean())],
        "area_px": int(mask.sum()),
    }


def falcon_detection_to_object(det, label, image_size):
    width, height = image_size
    if "mask" in det and det["mask"] is not None:
        base = mask_to_measurements(det["mask"])
    else:
        xyxy = xyxy_from_center_size(
            det["xy"]["x"], det["xy"]["y"], det["hw"]["w"], det["hw"]["h"], width, height
        )
        base = {
            "xyxy": xyxy,
            "centroid": [float((xyxy[0] + xyxy[2]) / 2), float((xyxy[1] + xyxy[3]) / 2)],
            "area_px": int(max(0.0, xyxy[2] - xyxy[0]) * max(0.0, xyxy[3] - xyxy[1])),
        }
    return {**base, "label": label}


def choose_primary(objects, label):
    matches = [obj for obj in objects if obj["label"] == label]
    if not matches:
        raise ValueError(f"No objects found for {label!r}")
    return max(matches, key=lambda obj: obj["area_px"])


def objects_to_detections(objects):
    if not objects:
        return sv.Detections(
            xyxy=np.zeros((0, 4), dtype=np.float32),
            class_id=np.array([], dtype=int),
            data={"class_name": []},
        )

    label_to_id = {label: idx for idx, label in enumerate(sorted({obj["label"] for obj in objects}))}
    xyxy = np.array([obj["xyxy"] for obj in objects], dtype=np.float32)
    class_id = np.array([label_to_id[obj["label"]] for obj in objects], dtype=int)
    class_name = [obj["label"] for obj in objects]
    return sv.Detections(xyxy=xyxy, class_id=class_id, data={"class_name": class_name})


BOX_ANNOTATOR = sv.BoxAnnotator(thickness=2, color_lookup=sv.ColorLookup.CLASS)
LABEL_ANNOTATOR = sv.LabelAnnotator(
    text_scale=0.45,
    text_padding=4,
    color_lookup=sv.ColorLookup.CLASS,
)


def render_annotated_scene(image, objects, show_labels=True):
    scene = np.array(image.convert("RGB")).copy()
    if not objects:
        return scene
    detections = objects_to_detections(objects)
    scene = BOX_ANNOTATOR.annotate(scene=scene, detections=detections)
    if show_labels:
        labels = [obj["label"] for obj in objects]
        scene = LABEL_ANNOTATOR.annotate(scene=scene, detections=detections, labels=labels)
    return scene


def _overlay_text(ax, x, y, text, ha="left"):
    ax.text(
        x,
        y,
        text,
        ha=ha,
        va="top",
        color="white",
        fontsize=11,
        bbox={"facecolor": "black", "alpha": 0.65, "pad": 4},
    )


def plot_result_scene(result):
    measurement = result["measurement"]
    analysis = measurement["analysis"]
    focus_objects = result["focus_objects"]
    show_labels = len(focus_objects) <= 8
    scene = render_annotated_scene(result["image"], focus_objects, show_labels=show_labels)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.imshow(scene)

    if analysis == "count_by_half":
        split_x = measurement["split_x"]
        ax.axvline(split_x, color="#00d4ff", linestyle="--", linewidth=2)
        _overlay_text(ax, split_x - 14, 26, f"left: {measurement['left_count']}", ha="right")
        _overlay_text(ax, split_x + 14, 26, f"right: {measurement['right_count']}")

    elif analysis == "width_threshold":
        _overlay_text(
            ax,
            18,
            26,
            f"width fraction = {measurement['width_fraction']:.4f} vs threshold {measurement['threshold']:.4f}",
        )

    elif analysis == "edge_margin_threshold":
        obj = focus_objects[0]
        x1, y1, x2, y2 = obj["xyxy"]
        edge = measurement["edge"]
        y_mid = (y1 + y2) / 2
        if edge == "right":
            ax.hlines(y_mid, x2, result["image"].size[0], colors="#ff4d6d", linewidth=3)
            _overlay_text(
                ax,
                result["image"].size[0] - 18,
                y_mid - 24,
                f"right margin {measurement['margin_px']} px vs threshold {measurement['threshold_px']} px",
                ha="right",
            )
        elif edge == "left":
            ax.hlines(y_mid, 0, x1, colors="#00d4ff", linewidth=3)
            _overlay_text(ax, 18, y_mid - 24, f"left margin {measurement['margin_px']} px vs threshold {measurement['threshold_px']} px")

    elif analysis == "centered_square_crop":
        crop_x1, crop_y1, crop_x2, crop_y2 = measurement["square_crop_xyxy"]
        rect = patches.Rectangle(
            (crop_x1, crop_y1),
            crop_x2 - crop_x1,
            crop_y2 - crop_y1,
            linewidth=3,
            edgecolor="#00d4ff",
            facecolor="none",
            linestyle="--",
        )
        ax.add_patch(rect)
        _overlay_text(ax, crop_x1 + 18, crop_y1 + 34, "centered square crop")

    ax.set_title(result["task"]["name"], fontsize=13)
    ax.axis("off")
    plt.tight_layout()
    plt.show()


def show_task_gallery(tasks):
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    for ax, task in zip(axes.flat, tasks):
        ax.imshow(load_image_local(task["image"]))
        ax.set_title(task["name"], fontsize=11)
        ax.set_xlabel(task["question"], fontsize=9)
        ax.axis("off")
    plt.tight_layout()
    plt.show()



def show_task_gallery_extended(tasks):
    """Show task gallery with question overlaid on each image."""
    import textwrap
    n = len(tasks)
    cols = min(n, 4)
    rows = (n + cols - 1) // cols
    fig, axes = plt.subplots(rows, cols, figsize=(15, 5 * rows))
    if rows == 1 and cols == 1:
        axes = [axes]
    elif rows == 1:
        axes = list(axes)
    else:
        axes = list(axes.flat)
    for i, task in enumerate(tasks):
        img = load_image_local(task['image'])
        axes[i].imshow(img)
        # Overlay question at the top of the image
        q = textwrap.fill(task['question'], width=35)
        axes[i].text(
            0.5, 0.97, q,
            transform=axes[i].transAxes, fontsize=9, fontweight='bold',
            color='white', ha='center', va='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='black', alpha=0.75),
        )
        # Overlay expected answer at bottom
        exp = task.get('expected')
        if exp is not None:
            exp_str = 'yes' if exp is True else 'no' if exp is False else str(exp)
            axes[i].text(
                0.5, 0.03, f'Expected: {exp_str}',
                transform=axes[i].transAxes, fontsize=8,
                color='white', ha='center', va='bottom',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='#166534', alpha=0.8),
            )
        axes[i].set_title(task['name'], fontsize=11, fontweight='bold', pad=8)
        axes[i].axis('off')
    for j in range(n, len(axes)):
        axes[j].axis('off')
    plt.tight_layout()
    plt.show()

Code

def analyze_count_by_half(objects, image_size, target_label):
    width, _ = image_size
    matches = [obj for obj in objects if obj['label'] == target_label]
    split_x = width / 2
    left_count = sum(obj['centroid'][0] < split_x for obj in matches)
    right_count = sum(obj['centroid'][0] >= split_x for obj in matches)
    winner = 'left' if left_count > right_count else 'right' if right_count > left_count else 'tie'
    return {
        'analysis': 'count_by_half',
        'target_label': target_label,
        'detected_instances': int(len(matches)),
        'split_x': round(float(split_x), 1),
        'left_count': int(left_count),
        'right_count': int(right_count),
        'winner': winner,
    }, matches


def analyze_width_threshold(objects, image_size, target_label, threshold):
    width, _ = image_size
    obj = choose_primary(objects, target_label)
    width_frac = (obj['xyxy'][2] - obj['xyxy'][0]) / width
    return {
        'analysis': 'width_threshold',
        'target_label': target_label,
        'width_fraction': round(float(width_frac), 4),
        'threshold': float(threshold),
        'passes': bool(width_frac >= threshold),
    }, [obj]


def analyze_edge_margin_threshold(objects, image_size, target_label, edge, threshold_px):
    width, height = image_size
    obj = choose_primary(objects, target_label)
    x1, y1, x2, y2 = obj['xyxy']
    margins = {'left': x1, 'right': width - x2, 'top': y1, 'bottom': height - y2}
    margin_px = float(margins[edge])
    return {
        'analysis': 'edge_margin_threshold',
        'target_label': target_label,
        'edge': edge,
        'margin_px': round(margin_px, 1),
        'threshold_px': float(threshold_px),
        'passes': bool(margin_px >= threshold_px),
    }, [obj]


def analyze_centered_square_crop(objects, image_size, target_label):
    width, height = image_size
    obj = choose_primary(objects, target_label)
    crop_size = min(width, height)
    crop_x1 = (width - crop_size) / 2
    crop_y1 = (height - crop_size) / 2
    crop_x2 = crop_x1 + crop_size
    crop_y2 = crop_y1 + crop_size
    x1, y1, x2, y2 = obj['xyxy']
    fits = x1 >= crop_x1 and x2 <= crop_x2 and y1 >= crop_y1 and y2 <= crop_y2
    return {
        'analysis': 'centered_square_crop',
        'target_label': target_label,
        'object_xyxy': [round(float(v), 1) for v in obj['xyxy']],
        'square_crop_xyxy': [round(float(crop_x1), 1), round(float(crop_y1), 1), round(float(crop_x2), 1), round(float(crop_y2), 1)],
        'fits_inside_center_crop': bool(fits),
    }, [obj]


def analyze_total_count(objects, image_size, target_label):
    matches = [obj for obj in objects if obj['label'] == target_label]
    return {
        'analysis': 'total_count',
        'target_label': target_label,
        'count': len(matches),
    }, matches


def analyze_aspect_ratio(objects, image_size, target_label, min_ratio):
    obj = choose_primary(objects, target_label)
    x1, y1, x2, y2 = obj['xyxy']
    w = x2 - x1
    h = y2 - y1
    ratio = w / h if h > 0 else float('inf')
    return {
        'analysis': 'aspect_ratio',
        'target_label': target_label,
        'width_px': round(float(w), 1),
        'height_px': round(float(h), 1),
        'ratio_w_over_h': round(float(ratio), 3),
        'min_ratio': float(min_ratio),
        'passes': bool(ratio >= min_ratio),
    }, [obj]


def analyze_area_fraction(objects, image_size, target_label, threshold):
    width, height = image_size
    obj = choose_primary(objects, target_label)
    obj_area = obj['area_px']
    image_area = width * height
    fraction = obj_area / image_area
    return {
        'analysis': 'area_fraction',
        'target_label': target_label,
        'object_area_px': int(obj_area),
        'image_area_px': int(image_area),
        'fraction': round(float(fraction), 4),
        'threshold': float(threshold),
        'passes': bool(fraction >= threshold),
    }, [obj]




def analyze_vertical_position(objects, image_size, target_label, region):
    """Check if primary object's centroid is in upper or lower half."""
    _, height = image_size
    obj = choose_primary(objects, target_label)
    cy = obj['centroid'][1]
    midline = height / 2
    in_upper = cy < midline
    actual_region = 'upper' if in_upper else 'lower'
    return {
        'analysis': 'vertical_position',
        'target_label': target_label,
        'centroid_y': round(float(cy), 1),
        'midline_y': round(float(midline), 1),
        'actual_region': actual_region,
        'expected_region': region,
        'passes': actual_region == region,
    }, [obj]


def measure_task(objects, image_size, analysis_spec):
    kind = analysis_spec['type']
    if kind == 'count_by_half':
        return analyze_count_by_half(objects, image_size, analysis_spec['target_label'])
    if kind == 'width_threshold':
        return analyze_width_threshold(objects, image_size, analysis_spec['target_label'], analysis_spec['threshold'])
    if kind == 'edge_margin_threshold':
        return analyze_edge_margin_threshold(
            objects, image_size, analysis_spec['target_label'],
            analysis_spec['edge'], analysis_spec['threshold_px'],
        )
    if kind == 'centered_square_crop':
        return analyze_centered_square_crop(objects, image_size, analysis_spec['target_label'])
    if kind == 'total_count':
        return analyze_total_count(objects, image_size, analysis_spec['target_label'])
    if kind == 'aspect_ratio':
        return analyze_aspect_ratio(objects, image_size, analysis_spec['target_label'], analysis_spec['min_ratio'])
    if kind == 'area_fraction':
        return analyze_area_fraction(objects, image_size, analysis_spec['target_label'], analysis_spec['threshold'])
    if kind == 'vertical_position':
        return analyze_vertical_position(objects, image_size, analysis_spec['target_label'], analysis_spec['region'])
    raise ValueError(f'Unknown analysis type: {kind}')

Code

def run_lightweight_tests():
    people = [
        {'label': 'person', 'xyxy': [10, 10, 30, 70], 'centroid': [20.0, 40.0], 'area_px': 1200},
        {'label': 'person', 'xyxy': [40, 10, 60, 70], 'centroid': [50.0, 40.0], 'area_px': 1200},
        {'label': 'person', 'xyxy': [80, 10, 100, 70], 'centroid': [90.0, 40.0], 'area_px': 1200},
    ]
    m1, _ = analyze_count_by_half(people, (120, 80), 'person')
    assert m1['winner'] == 'left'

    bus = [{'label': 'bus', 'xyxy': [10, 5, 80, 75], 'centroid': [45.0, 40.0], 'area_px': 4900}]
    m2, _ = analyze_width_threshold(bus, (120, 80), 'bus', 0.5)
    assert m2['passes'] is True

    monkey = [{'label': 'monkey', 'xyxy': [70, 10, 118, 70], 'centroid': [94.0, 40.0], 'area_px': 2880}]
    m3, _ = analyze_edge_margin_threshold(monkey, (120, 80), 'monkey', 'right', 100)
    assert m3['passes'] is False

    dog = [{'label': 'dog', 'xyxy': [12, 2, 104, 78], 'centroid': [58.0, 40.0], 'area_px': 6992}]
    m4, _ = analyze_centered_square_crop(dog, (120, 80), 'dog')
    assert m4['fits_inside_center_crop'] is False

    # New analysis tests
    m5, _ = analyze_total_count(people, (120, 80), 'person')
    assert m5['count'] == 3

    wide_obj = [{'label': 'bus', 'xyxy': [5, 20, 115, 50], 'centroid': [60.0, 35.0], 'area_px': 3300}]
    m6, _ = analyze_aspect_ratio(wide_obj, (120, 80), 'bus', 3.0)
    assert m6['passes'] is True  # 110/30 = 3.67

    m7, _ = analyze_area_fraction(bus, (120, 80), 'bus', 0.5)
    assert m7['passes'] is True  # 4900 / 9600 = 0.51

    # vertical_position test
    upper_obj = [{'label': 'cat', 'xyxy': [10, 5, 50, 35], 'centroid': [30.0, 20.0], 'area_px': 1200}]
    m_vp, _ = analyze_vertical_position(upper_obj, (120, 80), 'cat', 'upper')
    assert m_vp['passes'] is True

    detections = objects_to_detections([*people, *bus])
    assert detections.xyxy.shape == (4, 4)


run_lightweight_tests()
print('All geometry tests passed.')

All geometry tests passed.

The Tasks

Seven questions that sound easy but require actual measurement to answer reliably. Four have verified expected answers from previous runs; three are left ungraded so you can inspect the measurements yourself.

Code

TASKS = [
    {
        'name': 'Crowd left vs right',
        'image': 'crowd1.jpg',
        'question': 'Which side of the frame contains more people, left or right?',
        'queries': [{'label': 'person', 'query': 'person'}],
        'analysis': {'type': 'count_by_half', 'target_label': 'person'},
        'expected': 'right',
        'skill': 'counting by region',
        'lesson': 'Dense clusters draw the eye. A VLM sees a tight group and calls it — but scattered individuals on the other side can add up to more.',
    },
    {
        'name': 'Times Square left vs right',
        'image': 'crowd2.jpg',
        'question': 'Which side of the frame contains more people, left or right?',
        'queries': [{'label': 'person', 'query': 'person'}],
        'analysis': {'type': 'count_by_half', 'target_label': 'person'},
        'expected': None,  # depends on Falcon detection count — check after first run
        'skill': 'counting by region',
        'lesson': 'In a visually busy scene with signs, cars, and lights, the VLM\'s attention is split. Grounding forces an actual count.',
    },
    {
        'name': 'Monkey far from right edge?',
        'image': 'monkey.jpg',
        'question': 'Is the monkey more than 500 pixels away from the right edge of the image?',
        'queries': [{'label': 'monkey', 'query': 'monkey'}],
        'analysis': {'type': 'edge_margin_threshold', 'target_label': 'monkey', 'edge': 'right', 'threshold_px': 500},
        'expected': False,
        'skill': 'edge-distance threshold',
        'lesson': 'The monkey LOOKS comfortably centered. But its bounding box extends almost to the right edge — only ~1 pixel of margin. A VLM has no ruler.',
    },
    {
        'name': 'Bus under 90% width?',
        'image': 'bus.jpg',
        'question': 'Does the bus occupy less than 90% of the image width?',
        'queries': [{'label': 'bus', 'query': 'bus'}],
        'analysis': {'type': 'width_threshold', 'target_label': 'bus', 'threshold': 0.90},
        'expected': False,
        'skill': 'near-threshold measurement',
        'lesson': 'In a portrait photo with visible sidewalk on both sides, it LOOKS like the bus leaves room. But at 99.6% width, it essentially fills the frame.',
    },
    {
        'name': 'Dog taller than wide?',
        'image': 'happy-doggy.jpg',
        'question': 'Is the dog\'s bounding box taller than it is wide?',
        'queries': [{'label': 'dog', 'query': 'dog'}],
        'analysis': {'type': 'aspect_ratio', 'target_label': 'dog', 'min_ratio': 1.0},
        'expected': None,  # depends on detected bbox — check after first run
        'skill': 'aspect ratio',
        'lesson': 'The portrait orientation of the photo biases the impression. But the dog\'s actual bounding box may tell a different story.',
    },
    {
        'name': 'Dog square crop',
        'image': 'happy-doggy.jpg',
        'question': 'Would a centered square crop keep the dog fully inside the crop?',
        'queries': [{'label': 'dog', 'query': 'dog'}],
        'analysis': {'type': 'centered_square_crop', 'target_label': 'dog'},
        'expected': False,
        'skill': 'crop geometry',
        'lesson': 'The dog is centered and fills the frame — so it seems like a square crop would work. But in a portrait image, the square crop cuts from top and bottom, clipping the dog.',
    },
    {
        'name': 'Monkey in upper half?',
        'image': 'monkey.jpg',
        'question': 'Is the monkey\'s center point in the upper half of the image?',
        'queries': [{'label': 'monkey', 'query': 'monkey'}],
        'analysis': {'type': 'vertical_position', 'target_label': 'monkey', 'region': 'upper'},
        'expected': None,  # depends on detected bbox centroid — check after first run
        'skill': 'vertical position',
        'lesson': 'The monkey\'s body extends to the bottom, pulling the visual impression downward. But the centroid of the detected box tells the real story.',
    },
]

show_task_gallery_extended(TASKS)

Code

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

GEMMA_MODEL = os.environ.get('GEMMA_MODEL', 'mlx-community/gemma-4-31b-it-4bit')
FALCON_MODEL = os.environ.get('FALCON_PERCEPTION_MODEL', 'tiiuae/Falcon-Perception')


def load_local_models(gemma_model=GEMMA_MODEL, falcon_model=FALCON_MODEL):
    gemma, gemma_processor = load(gemma_model)
    gemma_config = load_config(gemma_model)
    falcon, falcon_processor = load(falcon_model, trust_remote_code=True)
    return {
        'gemma': gemma, 'gemma_processor': gemma_processor, 'gemma_config': gemma_config,
        'falcon': falcon, 'falcon_processor': falcon_processor,
    }


def gemma_reply(bundle, prompt, image=None, max_tokens=300):
    formatted = apply_chat_template(
        bundle['gemma_processor'], bundle['gemma_config'], prompt,
        num_images=1 if image is not None else 0,
    )
    result = generate(
        bundle['gemma'], bundle['gemma_processor'], formatted,
        image=image, max_tokens=max_tokens, verbose=False,
    )
    return result.text.strip()

objc[1883]: Class AVFFrameReceiver is implemented in both /Users/nipun/.uv/base/lib/python3.12/site-packages/cv2/.dylibs/libavdevice.61.3.100.dylib (0x11b9e03a8) and /Users/nipun/.uv/base/lib/python3.12/site-packages/av/.dylibs/libavdevice.62.1.100.dylib (0x312ef43a8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.
objc[1883]: Class AVFAudioReceiver is implemented in both /Users/nipun/.uv/base/lib/python3.12/site-packages/cv2/.dylibs/libavdevice.61.3.100.dylib (0x11b9e03f8) and /Users/nipun/.uv/base/lib/python3.12/site-packages/av/.dylibs/libavdevice.62.1.100.dylib (0x312ef43f8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.

Code

def verdict_options(task):
    atype = task['analysis']['type']
    if atype == 'count_by_half':
        return ['left', 'right', 'tie']
    if atype == 'total_count':
        return ['yes', 'no']
    return ['yes', 'no']


def answer_prompt(task, context, grounded=False):
    verdicts = ', '.join(verdict_options(task))
    if grounded:
        body = f"""Answer the question using only the measured facts.\n\nQuestion: {task['question']}\n\nMeasured facts:\n{json.dumps(context, indent=2)}\n"""
    else:
        body = f"""Look at the image and answer the question as best you can.\n\nQuestion: {task['question']}\n"""
    return body + f"""\nReturn exactly two lines:\nVerdict: <one of {verdicts}>\nReason: <one short sentence>\n"""


def run_falcon_queries(bundle, image_name, queries, segm_threshold=0.45):
    image = load_image_local(image_name)
    grounded = []
    for spec in queries:
        detections = bundle['falcon'].generate_perception(
            bundle['falcon_processor'], image=image, query=spec['query'],
            temperature=0.0, segm_threshold=segm_threshold,
        )
        grounded.extend(falcon_detection_to_object(det, spec['label'], image.size) for det in detections)
    return grounded


def raw_vlm_answer(bundle, task):
    image = load_image_local(task['image'])
    prompt = answer_prompt(task, None, grounded=False)
    return gemma_reply(bundle, prompt, image=image, max_tokens=120)


def grounded_explanation(bundle, task, measurement):
    prompt = answer_prompt(task, measurement, grounded=True)
    return gemma_reply(bundle, prompt, max_tokens=160)


def normalize_decision(answer, task):
    text = str(answer).strip().lower()
    match = re.search(r'^verdict:\s*(.+)$', text, flags=re.MULTILINE)
    if match:
        verdict = match.group(1).strip().split()[0].strip(' .,:;')
        if task['analysis']['type'] == 'count_by_half':
            if verdict in {'left', 'right', 'tie'}:
                return verdict
        else:
            if verdict in {'yes', 'true'}:
                return True
            if verdict in {'no', 'false'}:
                return False
    tail = text[-250:]
    if task['analysis']['type'] == 'count_by_half':
        if 'right side' in tail and ('more people' in tail or 'contains more' in tail):
            return 'right'
        if 'left side' in tail and ('more people' in tail or 'contains more' in tail):
            return 'left'
        return None
    if re.search(r'\b(no|false)\b', tail):
        return False
    if re.search(r'\b(yes|true)\b', tail):
        return True
    return None


def format_expected(value):
    if isinstance(value, bool):
        return 'yes' if value else 'no'
    return str(value)


def benefit_label(raw_correct, grounded_correct):
    if raw_correct is False and grounded_correct is True:
        return 'corrected'
    if raw_correct is True and grounded_correct is True:
        return 'both correct'
    if raw_correct is False and grounded_correct is False:
        return 'still wrong'
    if raw_correct is True and grounded_correct is False:
        return 'regressed'
    return 'ungraded'


def evidence_string(result):
    m = result['measurement']
    a = m['analysis']
    if a == 'count_by_half':
        return f"{m['left_count']}L vs {m['right_count']}R"
    if a == 'width_threshold':
        return f"{m['width_fraction']:.4f} vs {m['threshold']:.4f}"
    if a == 'edge_margin_threshold':
        return f"{m['margin_px']}px vs {m['threshold_px']}px"
    if a == 'centered_square_crop':
        return 'outside crop' if not m['fits_inside_center_crop'] else 'fits crop'
    if a == 'total_count':
        return f"{m['count']} detected"
    if a == 'aspect_ratio':
        return f"{m['ratio_w_over_h']:.2f} vs {m['min_ratio']:.1f}"
    if a == 'area_fraction':
        return f"{m['fraction']:.4f} vs {m['threshold']:.4f}"
    if a == 'vertical_position':
        return f"centroid y={m['centroid_y']} vs midline={m['midline_y']} -> {m['actual_region']}"
    return ''


def run_grounded_task(bundle, task, segm_threshold=0.45):
    image = load_image_local(task['image'])
    objects = run_falcon_queries(bundle, task['image'], task['queries'], segm_threshold=segm_threshold)
    measurement, focus_objects = measure_task(objects, image.size, task['analysis'])
    answer = grounded_explanation(bundle, task, measurement)
    return {
        'task': task, 'image': image, 'objects': objects,
        'focus_objects': focus_objects, 'measurement': measurement, 'answer': answer,
    }


def compare_raw_vs_grounded(bundle, task, segm_threshold=0.45):
    raw_answer = raw_vlm_answer(bundle, task)
    grounded = run_grounded_task(bundle, task, segm_threshold=segm_threshold)
    expected = task.get('expected')
    raw_prediction = normalize_decision(raw_answer, task)
    grounded_prediction = normalize_decision(grounded['answer'], task)
    raw_correct = None if expected is None or raw_prediction is None else raw_prediction == expected
    grounded_correct = None if expected is None or grounded_prediction is None else grounded_prediction == expected
    return {
        'raw_answer': raw_answer,
        'raw_prediction': raw_prediction,
        'grounded_prediction': grounded_prediction,
        'expected': expected,
        'raw_correct': raw_correct,
        'grounded_correct': grounded_correct,
        'benefit': benefit_label(raw_correct, grounded_correct),
        **grounded,
    }


def summarize_demo_suite(results):
    rows = []
    for r in results:
        rows.append({
            'Task': r['task']['name'],
            'Skill': r['task']['skill'],
            'Evidence': evidence_string(r),
            'Expected': format_expected(r['expected']),
            'Raw': format_expected(r['raw_prediction']) if r['raw_prediction'] is not None else '?',
            'Grounded': format_expected(r['grounded_prediction']) if r['grounded_prediction'] is not None else '?',
            'Benefit': r['benefit'],
        })
    return pd.DataFrame(rows)

Code

def plot_result_scene(result):
    measurement = result['measurement']
    analysis = measurement['analysis']
    focus_objects = result['focus_objects']
    show_labels = len(focus_objects) <= 8
    scene = render_annotated_scene(result['image'], focus_objects, show_labels=show_labels)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.imshow(scene)

    if analysis == 'count_by_half':
        split_x = measurement['split_x']
        ax.axvline(split_x, color='#00d4ff', linestyle='--', linewidth=2)
        _overlay_text(ax, split_x - 14, 26, f"left: {measurement['left_count']}", ha='right')
        _overlay_text(ax, split_x + 14, 26, f"right: {measurement['right_count']}")
    elif analysis == 'width_threshold':
        _overlay_text(ax, 18, 26, f"width = {measurement['width_fraction']:.4f} vs threshold {measurement['threshold']:.4f}")
    elif analysis == 'edge_margin_threshold':
        obj = focus_objects[0]
        x1, y1, x2, y2 = obj['xyxy']
        y_mid = (y1 + y2) / 2
        if measurement['edge'] == 'right':
            ax.hlines(y_mid, x2, result['image'].size[0], colors='#ff4d6d', linewidth=3)
            _overlay_text(ax, result['image'].size[0] - 18, y_mid - 24,
                          f"margin {measurement['margin_px']}px vs {measurement['threshold_px']}px", ha='right')
    elif analysis == 'centered_square_crop':
        crop_x1, crop_y1, crop_x2, crop_y2 = measurement['square_crop_xyxy']
        rect = patches.Rectangle((crop_x1, crop_y1), crop_x2 - crop_x1, crop_y2 - crop_y1,
                                 linewidth=3, edgecolor='#00d4ff', facecolor='none', linestyle='--')
        ax.add_patch(rect)
        _overlay_text(ax, crop_x1 + 18, crop_y1 + 34, 'centered square crop')
    elif analysis == 'total_count':
        _overlay_text(ax, 18, 26, f"detected: {measurement['count']} {measurement['target_label']}(s)")
    elif analysis == 'aspect_ratio':
        _overlay_text(ax, 18, 26, f"W/H = {measurement['ratio_w_over_h']:.2f} vs threshold {measurement['min_ratio']:.1f}")
    elif analysis == 'area_fraction':
        _overlay_text(ax, 18, 26, f"area = {measurement['fraction']:.4f} vs threshold {measurement['threshold']:.4f}")
    elif analysis == 'vertical_position':
        mid_y = measurement['midline_y']
        ax.axhline(mid_y, color='#00d4ff', linestyle='--', linewidth=2)
        _overlay_text(ax, 18, 26, f"centroid y={measurement['centroid_y']:.0f} | midline={mid_y:.0f} -> {measurement['actual_region']}")

    ax.set_title(result['task']['name'], fontsize=13)
    ax.axis('off')
    plt.tight_layout()
    plt.show()


def show_grounded_result(result, raw_answer=None):
    import textwrap
    plot_result_scene(result)
    task = result['task']
    print(f"Question: {task['question']}")
    print(f"Insight: {task['lesson']}")
    if raw_answer is not None:
        print(f"\nRaw Gemma: {raw_answer}")
        print(f"  -> parsed: {format_expected(result['raw_prediction']) if result['raw_prediction'] is not None else '?'}")
    print(f"\nMeasured: {json.dumps(result['measurement'], indent=2)}")
    print(f"\nGrounded Gemma: {result['answer']}")
    print(f"  -> parsed: {format_expected(result['grounded_prediction']) if result['grounded_prediction'] is not None else '?'}")
    print(f"Expected: {format_expected(result['expected'])} | Benefit: {result['benefit']}")

Head-to-Head: Raw Gemma vs. Grounded Pipeline

Code

bundle = load_local_models()
demo_results = [compare_raw_vs_grounded(bundle, task) for task in TASKS]
suite_summary = summarize_demo_suite(demo_results)

raw_hits = int(sum(r['raw_correct'] is True for r in demo_results))
grounded_hits = int(sum(r['grounded_correct'] is True for r in demo_results))
n = len(demo_results)

print(f'Raw Gemma: {raw_hits}/{n} correct')
print(f'Grounded:  {grounded_hits}/{n} correct')
display(suite_summary)

Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 79684.53it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 51839.71it/s]
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 101680.10it/s]

Raw Gemma: 2/7 correct
Grounded:  4/7 correct

	Task	Skill	Evidence	Expected	Raw	Grounded	Benefit
0	Crowd left vs right	counting by region	16L vs 19R	right	left	right	corrected
1	Times Square left vs right	counting by region	0L vs 170R	None	right	right	ungraded
2	Monkey far from right edge?	edge-distance threshold	1.0px vs 500.0px	no	yes	no	corrected
3	Bus under 90% width?	near-threshold measurement	0.9963 vs 0.9000	no	no	no	both correct
4	Dog taller than wide?	aspect ratio	0.88 vs 1.0	None	yes	yes	ungraded
5	Dog square crop	crop geometry	outside crop	no	no	no	both correct
6	Monkey in upper half?	vertical position	centroid y=3679.6 vs midline=2592.0 -> lower	None	yes	no	ungraded

Code

# Show measurements for tasks with no expected value
# Use this to set expected values after the first run
print('Tasks with unset expected values — inspect measurements to set them:\n')
for r in demo_results:
    if r['expected'] is None:
        m = r['measurement']
        print(f"  {r['task']['name']}:")
        print(f"    Measurement: {json.dumps({k: v for k, v in m.items() if k != 'analysis'}, indent=6)}")
        print(f"    Raw Gemma said: {r['raw_answer'][:100]}")
        print(f"    Grounded said:  {r['answer'][:100]}")
        print()

Tasks with unset expected values — inspect measurements to set them:

  Times Square left vs right:
    Measurement: {
      "target_label": "person",
      "detected_instances": 170,
      "split_x": 600.0,
      "left_count": 0,
      "right_count": 170,
      "winner": "right"
}
    Raw Gemma said: Verdict: right
Reason: There is a denser crowd of people gathered on the right side of the street.
    Grounded said:  Verdict: right
Reason: There are 170 people on the right and 0 on the left.

  Dog taller than wide?:
    Measurement: {
      "target_label": "dog",
      "width_px": 3992.0,
      "height_px": 4530.0,
      "ratio_w_over_h": 0.881,
      "min_ratio": 1.0,
      "passes": false
}
    Raw Gemma said: Verdict: yes
Reason: The dog's head and neck extend vertically more than they do horizontally in the
    Grounded said:  Verdict: yes
Reason: The height of 4530.0 px is greater than the width of 3992.0 px.

  Monkey in upper half?:
    Measurement: {
      "target_label": "monkey",
      "centroid_y": 3679.6,
      "midline_y": 2592.0,
      "actual_region": "lower",
      "expected_region": "upper",
      "passes": false
}
    Raw Gemma said: Verdict: yes
Reason: The monkey's head and torso are positioned primarily in the top half of the fra
    Grounded said:  Verdict: no
Reason: The monkey's actual region is lower, as its centroid_y (3679.6) is greater than

Scorecard

Two views: (1) how many tasks each approach got right, and (2) what grounding actually changed — did it correct a wrong answer, justify a right one, or regress?

Code

benefit_order = ['corrected', 'both correct', 'still wrong', 'regressed', 'ungraded']
benefit_colors_map = {
    'corrected': '#22c55e', 'both correct': '#60a5fa',
    'still wrong': '#ef4444', 'regressed': '#a855f7', 'ungraded': '#94a3b8',
}
benefit_counts = (
    pd.Series([r['benefit'] for r in demo_results])
    .value_counts().reindex(benefit_order, fill_value=0)
)

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

# 1: Accuracy comparison
bars = axes[0].bar(['Raw Gemma', 'Grounded'], [raw_hits, grounded_hits],
                    color=['#f97316', '#06b6d4'], width=0.5)
axes[0].set_ylim(0, n + 0.5)
axes[0].set_ylabel('Correct answers')
axes[0].set_title(f'Accuracy ({n} tasks)', fontsize=13)
for bar, val in zip(bars, [raw_hits, grounded_hits]):
    axes[0].text(bar.get_x() + bar.get_width() / 2, val + 0.1, f'{val}/{n}',
                 ha='center', fontsize=14, fontweight='bold')

# 2: Benefit breakdown
present = [(k, v) for k, v in zip(benefit_order, benefit_counts) if v > 0]
if present:
    labels, vals = zip(*present)
    colors = [benefit_colors_map[k] for k in labels]
    bars2 = axes[1].bar(labels, vals, color=colors)
    for bar, val in zip(bars2, vals):
        axes[1].text(bar.get_x() + bar.get_width() / 2, val + 0.05, str(int(val)),
                     ha='center', fontsize=12)
axes[1].set_title('What grounding changed', fontsize=13)
axes[1].tick_params(axis='x', rotation=15)

# 3: Per-task heatmap-style comparison
task_names = [r['task']['name'] for r in demo_results]
raw_scores = [1 if r['raw_correct'] else 0 for r in demo_results]
grounded_scores = [1 if r['grounded_correct'] else 0 for r in demo_results]

x = np.arange(len(task_names))
w = 0.35
b1 = axes[2].barh(x - w/2, raw_scores, w, label='Raw Gemma', color='#f97316', alpha=0.8)
b2 = axes[2].barh(x + w/2, grounded_scores, w, label='Grounded', color='#06b6d4', alpha=0.8)
axes[2].set_yticks(x)
axes[2].set_yticklabels(task_names, fontsize=8)
axes[2].set_xlabel('Correct (1) / Wrong (0)')
axes[2].set_title('Per-task comparison', fontsize=13)
axes[2].legend(loc='lower right', fontsize=8)
axes[2].set_xlim(-0.1, 1.3)

plt.suptitle('Raw VLM vs. Grounded Pipeline', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Per-Task Audit Trail

For each task: what Falcon detected, what Python measured, what raw Gemma said, and what grounded Gemma said. *The answer is no longer a black box.

Code

for result in demo_results:
    task = result['task']
    benefit = result['benefit']

    # Color-coded benefit badge
    badge_colors = {
        'corrected': '\033[92m CORRECTED \033[0m',
        'both correct': '\033[94m BOTH CORRECT \033[0m',
        'still wrong': '\033[91m STILL WRONG \033[0m',
        'regressed': '\033[95m REGRESSED \033[0m',
        'ungraded': '\033[90m UNGRADED \033[0m',
    }

    print('=' * 80)
    print(f"  {task['name']}  [{badge_colors.get(benefit, benefit)}]")
    print('=' * 80)
    plot_result_scene(result)
    print(f"Question: {task['question']}")
    print(f"Expected: {format_expected(result['expected'])}")
    print()
    print(f"Raw Gemma:      {result['raw_answer']}")
    print(f"  -> verdict:   {format_expected(result['raw_prediction']) if result['raw_prediction'] is not None else '?'}")
    correct_mark = '  ✓' if result['raw_correct'] else '  ✗' if result['raw_correct'] is False else '  ?'
    print(f"  -> correct?  {correct_mark}")
    print()
    print(f"Grounded Gemma: {result['answer']}")
    print(f"  -> verdict:   {format_expected(result['grounded_prediction']) if result['grounded_prediction'] is not None else '?'}")
    correct_mark = '  ✓' if result['grounded_correct'] else '  ✗' if result['grounded_correct'] is False else '  ?'
    print(f"  -> correct?  {correct_mark}")
    print()
    print(f"Evidence: {evidence_string(result)}")
    print(f"Insight: {task['lesson']}")
    print()

================================================================================

  Crowd left vs right  [ CORRECTED ]

================================================================================

Question: Which side of the frame contains more people, left or right?

Expected: right



Raw Gemma:      Verdict: left

Reason: There is a denser crowd of people gathered on the left side of the image.

  -> verdict:   left

  -> correct?    ✗



Grounded Gemma: Verdict: right

Reason: The right side has 19 people, while the left side has 16.

  -> verdict:   right

  -> correct?    ✓



Evidence: 16L vs 19R

Insight: Dense clusters draw the eye. A VLM sees a tight group and calls it — but scattered individuals on the other side can add up to more.



================================================================================

  Times Square left vs right  [ UNGRADED ]

================================================================================

Question: Which side of the frame contains more people, left or right?

Expected: None



Raw Gemma:      Verdict: right

Reason: There is a denser crowd of people gathered on the right side of the street.

  -> verdict:   right

  -> correct?    ?



Grounded Gemma: Verdict: right

Reason: There are 170 people on the right and 0 on the left.

  -> verdict:   right

  -> correct?    ?



Evidence: 0L vs 170R

Insight: In a visually busy scene with signs, cars, and lights, the VLM's attention is split. Grounding forces an actual count.



================================================================================

  Monkey far from right edge?  [ CORRECTED ]

================================================================================

Question: Is the monkey more than 500 pixels away from the right edge of the image?

Expected: no



Raw Gemma:      Verdict: yes

Reason: The monkey is positioned centrally and to the left, leaving a significant gap of more than 500 pixels from the right edge.

  -> verdict:   yes

  -> correct?    ✗



Grounded Gemma: Verdict: no

Reason: The monkey is not more than 500 pixels away from the right edge, as the check failed.

  -> verdict:   no

  -> correct?    ✓



Evidence: 1.0px vs 500.0px

Insight: The monkey LOOKS comfortably centered. But its bounding box extends almost to the right edge — only ~1 pixel of margin. A VLM has no ruler.



================================================================================

  Bus under 90% width?  [ BOTH CORRECT ]

================================================================================

Question: Does the bus occupy less than 90% of the image width?

Expected: no



Raw Gemma:      Verdict: no

Reason: The bus spans almost the entire width of the image.

  -> verdict:   no

  -> correct?    ✓



Grounded Gemma: Verdict: no

Reason: The bus occupies 99.63% of the image width, which is not less than 90%.

  -> verdict:   no

  -> correct?    ✓



Evidence: 0.9963 vs 0.9000

Insight: In a portrait photo with visible sidewalk on both sides, it LOOKS like the bus leaves room. But at 99.6% width, it essentially fills the frame.



================================================================================

  Dog taller than wide?  [ UNGRADED ]

================================================================================

Question: Is the dog's bounding box taller than it is wide?

Expected: None



Raw Gemma:      Verdict: yes

Reason: The dog's head and neck extend vertically more than they do horizontally in the frame.

  -> verdict:   yes

  -> correct?    ?



Grounded Gemma: Verdict: yes

Reason: The height of 4530.0 px is greater than the width of 3992.0 px.

  -> verdict:   yes

  -> correct?    ?



Evidence: 0.88 vs 1.0

Insight: The portrait orientation of the photo biases the impression. But the dog's actual bounding box may tell a different story.



================================================================================

  Dog square crop  [ BOTH CORRECT ]

================================================================================

Question: Would a centered square crop keep the dog fully inside the crop?

Expected: no



Raw Gemma:      Verdict: no

Reason: The dog's ears and nose extend beyond a centered square area.

  -> verdict:   no

  -> correct?    ✓



Grounded Gemma: Verdict: no

Reason: The fits_inside_center_crop value is false.

  -> verdict:   no

  -> correct?    ✓



Evidence: outside crop

Insight: The dog is centered and fills the frame — so it seems like a square crop would work. But in a portrait image, the square crop cuts from top and bottom, clipping the dog.



================================================================================

  Monkey in upper half?  [ UNGRADED ]

================================================================================

Question: Is the monkey's center point in the upper half of the image?
Expected: None

Raw Gemma:      Verdict: yes
Reason: The monkey's head and torso are positioned primarily in the top half of the frame.
  -> verdict:   yes
  -> correct?    ?

Grounded Gemma: Verdict: no
Reason: The monkey's actual region is lower, as its centroid_y (3679.6) is greater than the midline_y (2592.0).
  -> verdict:   no
  -> correct?    ?

Evidence: centroid y=3679.6 vs midline=2592.0 -> lower
Insight: The monkey's body extends to the bottom, pulling the visual impression downward. But the centroid of the detected box tells the real story.

Experiment: Why Does Raw Gemma Fail?

Let’s categorize the failure modes. When raw Gemma gets it wrong, is it because it:

Miscounted (counting tasks)
Misjudged a threshold (near-boundary tasks)
Hallucinated spatial facts (geometry tasks)

This matters because it tells us which tasks benefit most from grounding.

Code

# Categorize tasks by skill type
skill_categories = {
    'counting': ['counting by region', 'counting in clutter'],
    'threshold': ['near-threshold measurement', 'edge-distance threshold', 'aspect ratio estimation', 'area estimation', 'aspect ratio'],
    'spatial': ['crop geometry', 'vertical position'],
}

def categorize_skill(skill):
    for cat, skills in skill_categories.items():
        if skill in skills:
            return cat
    return 'other'

error_data = []
for r in demo_results:
    cat = categorize_skill(r['task']['skill'])
    error_data.append({
        'Task': r['task']['name'],
        'Category': cat,
        'Raw correct': bool(r['raw_correct']) if r['raw_correct'] is not None else False,
        'Grounded correct': bool(r['grounded_correct']) if r['grounded_correct'] is not None else False,
        'Benefit': r['benefit'],
    })

error_df = pd.DataFrame(error_data)

# Per-category accuracy
cat_summary = error_df.groupby('Category').agg(
    n=('Task', 'count'),
    raw_correct=('Raw correct', 'sum'),
    grounded_correct=('Grounded correct', 'sum'),
).reset_index()
cat_summary['Raw %'] = (cat_summary['raw_correct'] / cat_summary['n'] * 100).round(0).astype(int)
cat_summary['Grounded %'] = (cat_summary['grounded_correct'] / cat_summary['n'] * 100).round(0).astype(int)

print('Accuracy by task category:')
display(cat_summary[['Category', 'n', 'Raw %', 'Grounded %']])

# Grouped bar chart
fig, ax = plt.subplots(figsize=(8, 4))
x = range(len(cat_summary))
w = 0.35
ax.bar([i - w/2 for i in x], cat_summary['Raw %'], w, label='Raw Gemma', color='#f97316')
ax.bar([i + w/2 for i in x], cat_summary['Grounded %'], w, label='Grounded', color='#06b6d4')
ax.set_xticks(list(x))
ax.set_xticklabels(cat_summary['Category'])
ax.set_ylabel('Accuracy %')
ax.set_title('Where does grounding help most?', fontsize=13)
ax.legend()
ax.set_ylim(0, 110)
plt.tight_layout()
plt.show()

Accuracy by task category:

	Category	n	Raw %	Grounded %
0	counting	2	0	50
1	spatial	2	50	50
2	threshold	3	33	67

Experiment: The Confidence Problem

Here is the sneaky part: raw Gemma does not say “I’m not sure”. It gives a confident, well-reasoned answer that happens to be wrong. Let’s look at the raw answers for the tasks it got wrong and highlight the confident language.

Code

print('Tasks where raw Gemma was WRONG but sounded CONFIDENT:\n')
for r in demo_results:
    if r['raw_correct'] is False:
        print(f"--- {r['task']['name']} ---")
        print(f"Question: {r['task']['question']}")
        print(f"Raw answer: {r['raw_answer']}")
        print(f"Expected: {format_expected(r['expected'])}")
        print(f"Evidence from grounding: {evidence_string(r)}")
        print()

wrong_count = sum(r['raw_correct'] is False for r in demo_results)
print(f'\nTotal: {wrong_count}/{len(demo_results)} tasks where Gemma was confidently wrong.')
print('This is the core argument for grounding: confidence without measurement is dangerous.')

Tasks where raw Gemma was WRONG but sounded CONFIDENT:

--- Crowd left vs right ---
Question: Which side of the frame contains more people, left or right?
Raw answer: Verdict: left
Reason: There is a denser crowd of people gathered on the left side of the image.
Expected: right
Evidence from grounding: 16L vs 19R

--- Monkey far from right edge? ---
Question: Is the monkey more than 500 pixels away from the right edge of the image?
Raw answer: Verdict: yes
Reason: The monkey is positioned centrally and to the left, leaving a significant gap of more than 500 pixels from the right edge.
Expected: no
Evidence from grounding: 1.0px vs 500.0px


Total: 2/7 tasks where Gemma was confidently wrong.
This is the core argument for grounding: confidence without measurement is dangerous.

Exercises for Students

Add your own task: Pick an image, write a question that requires measurement, define the expected answer and analysis type. Run it through both pipelines. Did grounding help?
Break the grounding: Find a case where Falcon Perception grounds the wrong object. What happens to the pipeline? (Hint: try ambiguous queries like “animal” on an image with multiple animals.)
Threshold sensitivity: Take the bus width task and vary the threshold from 0.90 to 0.999. At what point does raw Gemma start getting it wrong? What does this tell you about VLM precision?
Count scaling: How many objects can Falcon reliably detect before counts start drifting? Test with crowd images of increasing density.
Pipeline cost: Measure the wall-clock time for raw vs. grounded. Is the extra latency worth the accuracy gain? When would you choose raw VLM over grounding?

Takeaways

VLMs are confidently wrong about numbers. Counting, thresholds, and spatial measurements are not language tasks. Grounding converts them into arithmetic.
The win is auditability. Even when both approaches get the right answer, the grounded pipeline shows why — you can inspect the masks, the measurements, and the reasoning chain.
Grounding helps most on threshold and counting tasks. These are exactly the tasks where “close enough” is not good enough.
The pattern is general. Falcon + Gemma is one instance. Any detector + any reasoner works. The insight is: measure first, then reason over measurements.

Limitations

If Falcon grounds the wrong object, the measurement is wrong too. Garbage in, garbage out.
Query design matters — “person” works, “the tall person on the left” might not.
This is not a detector benchmark. It is a workflow demonstration.

The Problem: VLMs Are Confidently Wrong About Numbers

The Fix: Perceive, Measure, Then Reason

Setup

The Tasks

Head-to-Head: Raw Gemma vs. Grounded Pipeline

Scorecard

Per-Task Audit Trail

Experiment: Why Does Raw Gemma Fail?

Experiment: The Confidence Problem

Exercises for Students

Takeaways

Limitations

Links