Gemini 3.5 Flash and Gemini Omni: What’s New, What’s Real, What I Could Measure

Google announced Gemini 3.5 Flash and Gemini Omni at I/O 2026. I run 3.5 Flash against my April 3.1 baseline on the same Gemini API: thinking levels, tokens-per-second, a 500K-token needle test, agentic tool use, and a coding micro-benchmark. Omni is not yet on the developer API, so the second half is notes from the announcement rather than measurements.
LLM
Gemini
benchmarks
latency
thinking
multimodal
Author

Nipun Batra

Published

May 20, 2026

Yesterday’s Google I/O 2026 keynote shipped two things worth poking at:

  1. gemini-3.5-flash — already GA on the developer API. Google’s claim: a Flash-tier model that beats Gemini 3.1 Pro on agentic and coding benchmarks at a fraction of the latency and price. The headline change in the API is the thinking_budget integer being replaced by a thinking_level enum (minimal | low | medium | high).
  2. Gemini Omni — a unified multimodal model that takes (text | image | audio | video) and emits video. Live for AI Pro/Ultra subscribers in the Gemini app and Flow; not on the developer API yet (“coming in weeks”).

Since I already have a Gemini 3.1 comparison post from April, the natural thing is to keep the same harness and just swap in 3.5 Flash. That’s the first half of this post. The second half is a short, honest writeup of what Omni does and why I couldn’t actually test it.

All measurements below were taken on 2026-05-20 against the public Gemini API with google-genai==1.56.0.

What changed in the API

From the migration guide, the user-visible changes in 3.5 Flash are:

  • thinking_budget: intthinking_level: "minimal" | "low" | "medium" | "high".
  • Default level moved from high (in 3.1) → medium (in 3.5).
  • Thought preservation is on by default: the reasoning trace from one turn is carried into the next, which improves multi-turn quality but inflates input tokens.
  • Same 1,048,576-token input window, same 65,536-token output cap as gemini-3.1-pro-preview.

The model is also listed at GA, no -preview suffix, which is the first time a Flash-tier Gemini has launched without a preview gate.

Code
import os
from google import genai

client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
print('SDK ready')

# Confirm 3.5 Flash is visible to this key
for m in client.models.list():
    n = m.name.lower()
    if 'gemini-3.5' in n or 'omni' in n:
        print(f'{m.name}  in={m.input_token_limit:,}  out={m.output_token_limit:,}')
models/gemini-3.5-flash  in=1,048,576  out=65,536

That is the only 3.5 entry I see. No omni anywhere — confirmed below.

Thinking levels: what actually changes per level

The thinking_level enum is the most user-facing change. To measure what each level costs, I asked the same toy puzzle five times — one per level on 3.5, plus gemini-3.1-flash-lite as a “no thinking” baseline.

Code
import time, asyncio
from google.genai import types

PUZZLE = ("A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. "
          "How much does the ball cost? Reply with just the number in cents.")

async def ask(model, level=None):
    cfg = None
    if level is not None:
        cfg = types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level),
        )
    t0 = time.time()
    kw = {'model': model, 'contents': PUZZLE}
    if cfg: kw['config'] = cfg
    r = await client.aio.models.generate_content(**kw)
    u = r.usage_metadata
    return dict(
        config=f'{model}/{level or "-"}',
        dt=round(time.time()-t0, 2),
        answer=r.text.strip(),
        thoughts=getattr(u, 'thoughts_token_count', None) or 0,
        out=u.candidates_token_count,
    )

rows = await asyncio.gather(
    ask('gemini-3.1-flash-lite'),
    ask('gemini-3.5-flash', 'minimal'),
    ask('gemini-3.5-flash', 'low'),
    ask('gemini-3.5-flash', 'medium'),
    ask('gemini-3.5-flash', 'high'),
)
import pandas as pd
pd.DataFrame(rows)

A representative run on my key:

config dt (s) answer thoughts out
gemini-3.1-flash-lite/- 0.93 5 0 1
gemini-3.5-flash/minimal 0.93 5 0 1
gemini-3.5-flash/low 1.97 5 245 1
gemini-3.5-flash/medium 2.21 5 302 1
gemini-3.5-flash/high 2.42 5 387 1

Two things worth flagging:

  • minimal is genuinely a no-think mode. thoughts_token_count is None (not zero), latency matches flash-lite, and the SDK doesn’t bill for reasoning at all. If you were using 3.1 Flash for cheap classification, 3.5-flash at minimal is the drop-in replacement.
  • Each level up costs ~50-100 reasoning tokens on this trivial puzzle. That’s the floor. On harder problems the model spends much more (see the coding section below).

Output throughput: tokens per second

Google’s headline claim is “~4x output speed vs other frontier models” (DeepMind page). To check, I asked each model to write a fixed-length 400-word essay and divided output tokens by wall-clock seconds. I’m comparing two API endpoints (3.5 Flash and 3.1 Flash Lite) plus 3.1 Pro for reference. Single-call, no streaming; this measures end-to-end throughput, not steady-state generation rate.

Code
PROMPT = "Write a 400-word essay explaining backpropagation to a CS undergraduate. No markdown."

async def measure(model, level=None):
    cfg = None
    if level is not None:
        cfg = types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level))
    t0 = time.time()
    kw = {'model': model, 'contents': PROMPT}
    if cfg: kw['config'] = cfg
    r = await client.aio.models.generate_content(**kw)
    dt = time.time() - t0
    out = r.usage_metadata.candidates_token_count
    return dict(model=f"{model}/{level or '-'}",
                dt=round(dt, 2), out=out,
                tps=round(out/dt, 1) if dt else 0,
                thoughts=getattr(r.usage_metadata, 'thoughts_token_count', None) or 0)

rows = await asyncio.gather(
    measure('gemini-3.5-flash', 'minimal'),
    measure('gemini-3.5-flash', 'medium'),
    measure('gemini-3.1-flash-lite'),
    measure('gemini-3.1-pro-preview'),
)
pd.DataFrame(rows)

My run (single sample, your network will vary):

model dt (s) out tok tok/s thoughts
gemini-3.5-flash/minimal 3.84 468 121.9 0
gemini-3.5-flash/medium 9.77 443 45.3 1616
gemini-3.1-flash-lite 3.01 472 156.7 0
gemini-3.1-pro-preview 28.45 465 16.3 2875

Read carefully:

  • At minimal, 3.5 Flash hits ~122 tok/s end-to-end. 3.1 Flash Lite is faster at ~157 tok/s on this prompt — Lite is the smaller model, so this is expected. Google’s “4x” figure compares against frontier models like Pro or Opus, not against their own Lite tier.
  • At medium thinking, 3.5 Flash spends 1,616 reasoning tokens on a creative-writing task that arguably needs none, which drops effective throughput by 2.7x. Defaulting to medium is a footgun for streaming UIs. For chatbot-style latency, override to minimal or low.
  • vs 3.1 Pro: 3.5 Flash at medium is 2.9x faster wall-clock, and at minimal it’s 7.4x faster, for outputs of comparable length.

Streaming time-to-first-token (TTFT) is a more honest UX number. I re-ran the same prompt as a streaming call:

Code
async def stream_ttft(model, level=None):
    cfg = None
    if level is not None:
        cfg = types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level))
    t0 = time.time()
    first = None
    kw = {'model': model, 'contents': PROMPT}
    if cfg: kw['config'] = cfg
    async for chunk in await client.aio.models.generate_content_stream(**kw):
        if chunk.text and first is None:
            first = time.time() - t0
    return dict(model=f"{model}/{level or '-'}",
                ttft_s=round(first or -1, 2),
                total_s=round(time.time()-t0, 2))

rows = []
for spec in [('gemini-3.1-flash-lite', None),
             ('gemini-3.5-flash', 'minimal'),
             ('gemini-3.5-flash', 'medium'),
             ('gemini-3.1-pro-preview', None)]:
    rows.append(await stream_ttft(*spec))
pd.DataFrame(rows)
model TTFT (s) total (s)
gemini-3.1-flash-lite 0.62 2.00
gemini-3.5-flash/minimal 1.20 2.95
gemini-3.5-flash/medium 5.65 6.93
gemini-3.1-pro-preview 14.29 16.25

This is the chart that changes how I’d deploy 3.5 Flash. With medium thinking (the default!), the first token doesn’t arrive for almost 6 seconds because the model burns reasoning tokens you can’t display. If you’re building a chat UI, set thinking_level="minimal" or "low" for free-form generation, and only bump it for tasks that actually need deliberation.

500K-token needle: long context still holds

3.1 Flash already had a 1M-token window. The interesting question is whether 3.5 Flash regresses anywhere when the input gets large. I built a ~524K-token haystack of plausible filler (“Compliance audits should be filed before the close of the fiscal year.” etc.), inserted a single needle 78% of the way through, and asked for it:

Code
import random
random.seed(7)
filler = ["The quarterly revenue grew by 3.4 percent across the EMEA region.",
          "Maintenance windows are scheduled every second Tuesday of the month.",
          "Compliance audits should be filed before the close of the fiscal year.",
          "The latency budget for the checkout flow is 250 milliseconds.",
          "Customer onboarding requires KYC verification within 48 hours.",
          "Backups are retained for ninety days in cold storage.",
          "Feature flags are managed through the central configuration service.",
          "The internal mailing list is rate-limited to twenty messages per hour.",
          "Engineering managers review the on-call rotation at every retrospective.",
          "Code reviews require two approvals from the core team."]
random.shuffle(filler)
parts = (filler * 4000)
NEEDLE = "The secret access code for the Hyderabad facility is BANYAN-7421."
parts.insert(int(len(parts) * 0.78), NEEDLE)
HAYSTACK = "\n".join(parts)
QUESTION = "What is the secret access code for the Hyderabad facility? Answer with just the code."

async def find_needle(model, level=None):
    cfg = None
    if level is not None:
        cfg = types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level))
    t0 = time.time()
    kw = {'model': model, 'contents': HAYSTACK + "\n\n" + QUESTION}
    if cfg: kw['config'] = cfg
    r = await client.aio.models.generate_content(**kw)
    u = r.usage_metadata
    return dict(model=f"{model}/{level or '-'}",
                found='BANYAN-7421' in r.text,
                answer=r.text.strip()[:40],
                in_tok=u.prompt_token_count, dt=round(time.time()-t0, 2))

rows = await asyncio.gather(
    find_needle('gemini-3.1-flash-lite'),
    find_needle('gemini-3.5-flash', 'minimal'),
    find_needle('gemini-3.5-flash', 'medium'),
)
pd.DataFrame(rows)
model found answer in_tok dt (s)
gemini-3.1-flash-lite True BANYAN-7421 524,037 5.24
gemini-3.5-flash/minimal True BANYAN-7421 524,037 5.11
gemini-3.5-flash/medium True BANYAN-7421 524,037 7.12

All three retrieve correctly. The interesting datapoint is the latency: at half-a-million input tokens, 3.5 Flash at minimal is the same speed as 3.1 Flash Lite, while 3.1 Pro on the same prompt costs about 5x more time. This is a single needle, so it doesn’t say anything about multi-hop reasoning over long context — but the baseline retrieval is intact at the new model.

A tiny coding micro-benchmark

The benchmark Google leads with is agentic coding — Terminal-Bench 2.1, MCP Atlas, OSWorld. I can’t run those in a notebook, so I built a stripped-down honest version: three Python functions of increasing difficulty, ask each model to write them, then execute the result against hidden test cases.

Tasks:

  1. reverse_words(s) — easy
  2. spiral_order(matrix) — medium (lots of edge cases: empty, single row, single column, non-square)
  3. word_break(s, words) — medium-hard (needs DP, not greedy)

I run each task at each thinking level and compare to gemini-3.1-flash-lite.

Code
import re, subprocess, json

TASKS = [
    {'name': 'spiral_order',
     'prompt': 'Write a Python function spiral_order(matrix: list[list[int]]) -> list[int] that returns all elements of an m x n matrix in spiral (clockwise) order starting from the top-left. Handle empty matrices and non-square shapes. ONLY output a fenced python block.',
     'tests': [
         ("[[1,2,3],[4,5,6],[7,8,9]]", "[1, 2, 3, 6, 9, 8, 7, 4, 5]"),
         ("[[1,2,3,4],[5,6,7,8],[9,10,11,12]]", "[1, 2, 3, 4, 8, 12, 11, 10, 9, 5, 6, 7]"),
         ("[]", "[]"),
         ("[[1]]", "[1]"),
         ("[[1,2,3]]", "[1, 2, 3]"),
         ("[[1],[2],[3]]", "[1, 2, 3]"),
     ]},
    {'name': 'word_break',
     'prompt': 'Write a Python function word_break(s: str, words: list[str]) -> bool that returns True iff s can be segmented into a sequence of one or more words from the list (words may be reused). ONLY output a fenced python block.',
     'tests': [
         ("'leetcode', ['leet','code']", "True"),
         ("'applepenapple', ['apple','pen']", "True"),
         ("'catsandog', ['cats','dog','sand','and','cat']", "False"),
         ("'', ['a']", "True"),
         ("'aaaaaaa', ['aaa','aaaa']", "True"),
     ]},
]

def extract(text):
    m = re.search(r'```(?:python)?\s*(.*?)```', text, re.DOTALL)
    return m.group(1).strip() if m else text.strip()

def run_tests(code, fn, tests):
    body = code + "\nimport json\nr=[]\n"
    for args, expected in tests:
        body += f"try:\n r.append(repr({fn}({args}))=={expected!r})\nexcept: r.append(False)\n"
    body += "print(json.dumps(r))"
    p = subprocess.run(['python3', '-c', body], capture_output=True, text=True, timeout=6)
    if p.returncode != 0: return 0, len(tests)
    res = json.loads(p.stdout.strip().split('\n')[-1])
    return sum(res), len(res)

async def ask(model, prompt, level=None):
    cfg = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level=level)) if level else None
    t0 = time.time()
    kw = {'model': model, 'contents': prompt}
    if cfg: kw['config'] = cfg
    r = await client.aio.models.generate_content(**kw)
    return (r.text, round(time.time()-t0, 2),
            getattr(r.usage_metadata, 'thoughts_token_count', None) or 0)

rows = []
configs = [('gemini-3.1-flash-lite', None),
           ('gemini-3.5-flash', 'minimal'),
           ('gemini-3.5-flash', 'low'),
           ('gemini-3.5-flash', 'medium'),
           ('gemini-3.5-flash', 'high')]
for task in TASKS:
    rs = await asyncio.gather(*[ask(m, task['prompt'], l) for m, l in configs])
    for (m, l), (text, dt, th) in zip(configs, rs):
        p, t = run_tests(extract(text), task['name'], task['tests'])
        rows.append({'task': task['name'],
                     'label': 'lite' if 'lite' in m else f'flash/{l}',
                     'dt_s': dt, 'thoughts': th, 'pass': f'{p}/{t}'})
pd.DataFrame(rows)

Results from my run (single sample per cell, so noisy on close calls):

spiral_order (everyone passed all 6 tests):

label dt (s) thoughts
lite 1.88 0
flash/minimal 1.99 0
flash/low 4.11 408
flash/medium 7.29 1,241
flash/high 11.44 2,192

word_break:

label dt (s) thoughts pass
lite 1.16 0 5/5
flash/minimal 1.98 0 5/5
flash/low 3.40 298 5/5
flash/medium 8.43 1,619 4/5
flash/high 6.57 1,050 5/5

The flash/medium regression on word_break (4/5 vs 5/5 for both flanking levels) is the kind of thing that worries me. On a single sample it could be noise — medium thoughts overshot and the model produced an over-engineered solution that broke on the aaaaaaa case. With n=1 I can’t claim a real regression, but it does demonstrate that more thinking ≠ better and that the bug-shaped middle of the thinking range is real enough to land in a single trial.

What I’d say with conviction:

  • For these size-of-leetcode-easy problems, minimal is enough and is the cheapest defensible default.
  • The medium default is calibrated for harder tasks (Terminal-Bench, multi-file refactors). On toy puzzles it spends 8-10x more compute for the same answer.

Agentic tool use: same accuracy, lower latency

Same harness, slightly more interesting task: give the model two Python functions over an in-memory employee DB, ask it to compute total engineering-department salary. The optimal trace is list_employee_ids() + 3 get_employee() calls for the 3 eng employees. I logged every call.

Code
EMPLOYEES = {
    'E101': {'name': 'Asha Rao',   'dept': 'eng',     'salary': 95000},
    'E102': {'name': 'Ben Kumar',  'dept': 'eng',     'salary': 105000},
    'E103': {'name': 'Cara Singh', 'dept': 'sales',   'salary': 80000},
    'E104': {'name': 'Devi Iyer',  'dept': 'eng',     'salary': 120000},
    'E105': {'name': 'Eli Bose',   'dept': 'support', 'salary': 60000},
}
LOG = []
def list_employee_ids() -> dict:
    """List all employee IDs in the database."""
    LOG.append('list_employee_ids')
    return {'ids': list(EMPLOYEES.keys())}

def get_employee(employee_id: str) -> dict:
    """Look up an employee by their ID. Returns name, department, and salary."""
    LOG.append(f'get_employee({employee_id})')
    return EMPLOYEES.get(employee_id, {'error': 'not found'})

PROMPT = ("Using the tools, find the total salary spend across the engineering "
          "department. Report just the total amount in USD.")

async def run_agent(model, level=None):
    LOG.clear()
    cfg_kw = {'tools': [list_employee_ids, get_employee]}
    if level is not None:
        cfg_kw['thinking_config'] = types.ThinkingConfig(thinking_level=level)
    cfg = types.GenerateContentConfig(**cfg_kw)
    t0 = time.time()
    r = await client.aio.models.generate_content(
        model=model, contents=PROMPT, config=cfg)
    return dict(model=f'{model}/{level or "-"}',
                answer=r.text.strip()[:20],
                n_calls=len(LOG), trace=list(LOG),
                dt=round(time.time()-t0, 2))

for spec in [('gemini-3.1-flash-lite', None),
             ('gemini-3.1-pro-preview', None),
             ('gemini-3.5-flash', 'minimal'),
             ('gemini-3.5-flash', 'low'),
             ('gemini-3.5-flash', 'medium')]:
    print(await run_agent(*spec))
model answer n_calls dt (s)
gemini-3.1-flash-lite 320000 6 6.98
gemini-3.1-pro-preview $320,000 6 9.34
gemini-3.5-flash/minimal $320,000 6 3.05
gemini-3.5-flash/low $320,000 6 4.63
gemini-3.5-flash/medium $320,000 6 4.88

Two things to notice:

  • All five models get the right answer ($95K + $105K + $120K = $320K), but none of them find the optimal trace. Every model calls get_employee for all 5 employees instead of filtering on dept == 'eng' after the list. That’s not a 3.5 vs 3.1 distinction — it’s a property of how these models do tool use (“fetch everything, reason later”) that hasn’t changed.
  • 3.5-flash/minimal is ~2.3x faster than 3.1-flash-lite here. That’s the surprising data point: at the cheapest thinking setting, 3.5 Flash is faster than 3.1 Flash Lite on a tool-using task, even though Lite is the smaller architecture. My guess is the per-step latency of the tool-call loop has been reduced in 3.5 (fewer round-trips, smarter batching). I can’t verify this from the outside, but the wall-clock is real.

Putting it together: when to reach for 3.5 Flash

Based on the runs above, here’s the policy I’m adopting on my own projects:

use case model thinking_level
chatbot/freeform generation gemini-3.5-flash minimal
classification / extraction gemini-3.5-flash minimal
code-edit suggestions gemini-3.5-flash low
agentic tool use gemini-3.5-flash low
novel reasoning / math gemini-3.5-flash medium (default)
my old 3.1-pro-preview slot gemini-3.5-flash high, falls back to 3.1-pro if scores drop

The two things to remember:

  1. medium is the new default and it costs you ~1,500 reasoning tokens per call even for easy prompts. If you were on 3.1-flash-lite, the drop-in is 3.5-flash/minimal, not the bare 3.5-flash.
  2. TTFT under default settings is multi-second. If you’re streaming to a UI, drop the level or pre-warm with a probe call.

Gemini Omni — what I would have measured

Gemini Omni was announced at I/O as a unified multimodal model: text + image + audio + video in, and (eventually) text + image + audio + video out. The first product surface is video generation — submit images, audio clips, a script, and get a video that respects all of them.

Per the I/O announcement and follow-up posts (Decrypt coverage, 9to5Google liveblog):

  • Where it lives today: Gemini app + Google Flow for AI Plus / Pro / Ultra subscribers. Not on the public Gemini API as of 2026-05-20.
  • Output format: ~10-second video clips, 1280x720, async generation (submit, poll).
  • Input modalities: text, images, audio, video — combined in one prompt.
  • Notable hold-back: Google explicitly delayed the riskiest feature — generating a video of a specific named person from a single reference photo — citing safety review.

I tried the obvious model IDs against the developer API:

for m in ['gemini-omni', 'gemini-omni-preview', 'gemini-omni-flash',
          'gemini-3.5-omni', 'gemini-3-omni']:
    try:
        client.models.generate_content(model=m, contents='ping')
    except Exception as e:
        print(f'{m}: {str(e).splitlines()[0][:80]}')
gemini-omni: 404 NOT_FOUND. models/gemini-omni is not found for API version v1beta
gemini-omni-preview: 404 NOT_FOUND.
gemini-omni-flash: 404 NOT_FOUND.
gemini-3.5-omni: 404 NOT_FOUND.
gemini-3-omni: 404 NOT_FOUND.

Every plausible model ID returns 404. The closest available endpoints in client.models.list() are:

  • veo-3.1-generate-preview / veo-3.1-fast-generate-preview — the existing text-and-image-to-video model
  • gemini-3.1-flash-live-preview — the existing real-time multimodal streaming model (audio + video in, audio out)

Neither is Omni. Veo doesn’t take audio input or accept reasoning instructions; the live preview doesn’t generate video.

If/when Omni hits the API, the experiments I’d run — and that I’ll come back and add to this post — are:

  1. Audio-grounded video. Upload a 5-second voice memo + a single still photo; ask Omni to animate the photo such that the lip-sync matches the audio. This is the test that distinguishes Omni from Veo (text→video) and from the existing live preview (audio in, audio out).
  2. Style transfer with a reference image. Provide a photo of a teaching whiteboard + a 30-second explanation script + a “make this a Khan Academy style chalkboard walkthrough” instruction. Measure whether the output respects all three inputs or just the script.
  3. Edit-by-instruction. Submit a 10-second clip + a textual edit (“remove the person in the background, brighten the foreground”). This is the test that justifies the “world understanding” framing.
  4. Round-trip latency. End-to-end time from API submit to first video frame, at the documented “10-second clip” length. Anything under 30s is genuinely new; the 1-3 minute number from the API leaks would put Omni in the same UX bracket as today’s Veo.

For now, the honest summary is: Omni is a consumer release. Treat all “Omni does X” claims about API behaviour, pricing, and quality with skepticism until Google actually publishes the model card and pricing. I’ll update this post when the API rolls out.

Takeaways

  • gemini-3.5-flash is a real upgrade over 3.1-flash-lite for any task that benefits from light reasoning, but the new medium default makes it slower than 3.1-flash-lite on prompts that don’t need thinking. Set thinking_level="minimal" unless you have a reason not to.
  • The thinking_level enum is more honest than the old integer budget — it makes “this prompt didn’t need thinking” representable instead of guessable. But because the default moved from high (3.1) to medium (3.5), your old code will now spend ~1.5x more thinking tokens on average without warning.
  • Streaming TTFT is the metric that bit me. At default settings, 3.5 Flash spends 5+ seconds reasoning before emitting the first token. For chat UIs this matters more than total throughput.
  • Long context (524K tokens) works the same as on 3.1. No measurable regression on single-needle retrieval.
  • Omni is announced, not released (for developers). The Gemini app and Flow have it; the API doesn’t. Plan accordingly.

Reproducing this

Everything above runs against the public Gemini API with google-genai ≥ 1.56.0 and a single env var (GEMINI_API_KEY). The five probes are small enough to drop into one notebook — total token spend for all the measurements in this post was a few hundred thousand input tokens (mostly the haystack) plus a few thousand output tokens, well under $1 at posted pricing.