Gemini 3.1 Flash TTS: Audio Tags, English and Hindi in One Notebook

Google’s new Gemini 3.1 Flash TTS adds natural-language ‘audio tags’ like [excitement] or [like dracula] that steer voice style inline. I run it in English and Hindi and embed the results.
LLM
Gemini
TTS
audio
Hindi
multilingual
audio-tags
Author

Nipun Batra

Published

April 16, 2026

Google shipped Gemini 3.1 Flash TTS today — their most expressive and controllable text-to-speech model to date. The headline feature is audio tags: bracketed natural-language directions like [excitement], [whispers], [like dracula] or [storyteller, slow] that you embed inline with the text to steer delivery without a separate voice-config schema.

It supports 70+ languages, with 24 evaluated as high-quality — including Hindi, Japanese and Arabic. This post runs the model end-to-end against the public Gemini API and embeds the resulting audio for English, Hindi and a code-mixed Hinglish prompt.

Setup

Code
# One-time install
# !uv pip install -U google-genai
Code
import os, wave
from pathlib import Path
from IPython.display import Audio, display, Markdown
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
MODEL = "gemini-3.1-flash-tts-preview"
OUT = Path("gemini_tts"); OUT.mkdir(exist_ok=True)
print("SDK ready. Using", MODEL)
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
SDK ready. Using gemini-3.1-flash-tts-preview

A minimal synthesis helper

The TTS endpoint is a normal generateContent call with response_modalities=["AUDIO"] and a voice config. The model returns raw 16-bit little-endian PCM (audio/l16) at 24 kHz, so we wrap it in a WAV header to keep browsers happy.

Code
def synth(text, out_path, voice="Kore"):
    """Synthesize `text` with `voice` and save as a 24 kHz mono WAV."""
    resp = client.models.generate_content(
        model=MODEL,
        contents=text,
        config=types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
                )
            ),
        ),
    )
    part = resp.candidates[0].content.parts[0]
    data = part.inline_data.data
    rate = int([t.split("=")[1] for t in part.inline_data.mime_type.split(";") if "rate=" in t][0])
    with wave.open(str(out_path), "wb") as wf:
        wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(rate)
        wf.writeframes(data)
    return out_path

English — no tags vs audio tags

The tag-free version is a neutral read. Swap in [excitedly] and [whispers] and the same sentence acquires pacing, amplitude changes, and a conspiratorial drop on the second clause.

1. Plain English

Code
text = 'Today Google launched Gemini 3.1 Flash TTS, their most expressive and controllable text to speech model yet.'
# synth(text, OUT / 'en_plain.wav', voice='Kore')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/en_plain.wav')

Voice: Kore
Prompt: Today Google launched Gemini 3.1 Flash TTS, their most expressive and controllable text to speech model yet.

2. English with [excitedly] and [whispers]

Code
text = '[excitedly] Today Google launched Gemini 3.1 Flash TTS! [whispers] Their most expressive and controllable text to speech model, yet.'
# synth(text, OUT / 'en_tagged.wav', voice='Kore')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/en_tagged.wav')

Voice: Kore
Prompt: [excitedly] Today Google launched Gemini 3.1 Flash TTS! [whispers] Their most expressive and controllable text to speech model, yet.

3. [like dracula] persona tag

Code
text = '[like dracula] I vant to tell you about audio tags. [laughs] They are, how you say, deliciously flexible.'
# synth(text, OUT / 'en_dracula.wav', voice='Puck')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Puck`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/en_dracula.wav')

Voice: Puck
Prompt: [like dracula] I vant to tell you about audio tags. [laughs] They are, how you say, deliciously flexible.

The [like dracula] tag is worth dwelling on: it is not a fixed enum value, it is a free-text style instruction. The model parses the bracketed phrase as a direction and colours the delivery accordingly. This is the same shape as DALL·E-style textual style control, transplanted onto speech.

Hindi — same tags, different language

Hindi is one of the 24 high-quality evaluated languages. The audio tags travel: they are written in English (the control surface) but steer delivery of Devanagari text.

4. Plain Hindi

Code
text = 'नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है।'
# synth(text, OUT / 'hi_plain.wav', voice='Kore')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/hi_plain.wav')

Voice: Kore
Prompt: नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है।

5. Hindi with [excited] and [whispers]

Code
text = '[excited] नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है। [whispers] यह उनका अब तक का सबसे अभिव्यंजक मॉडल है।'
# synth(text, OUT / 'hi_tagged.wav', voice='Kore')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/hi_tagged.wav')

Voice: Kore
Prompt: [excited] नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है। [whispers] यह उनका अब तक का सबसे अभिव्यंजक मॉडल है।

6. Hindi [storyteller, slow] + [mysterious]

Code
text = '[storyteller, slow] एक समय की बात है, एक छोटे से गाँव में एक बुद्धिमान उल्लू रहता था। [mysterious] लोग कहते थे कि वह सपनों की भाषा जानता था।'
# synth(text, OUT / 'hi_story.wav', voice='Charon')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Charon`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/hi_story.wav')

Voice: Charon
Prompt: [storyteller, slow] एक समय की बात है, एक छोटे से गाँव में एक बुद्धिमान उल्लू रहता था। [mysterious] लोग कहते थे कि वह सपनों की भाषा जानता था।

Bonus: Hinglish code-mix

The more useful Indian-language test is code-mix: Hindi-English in a single utterance, which is how people actually talk. The model handles it in one pass — no per-segment splitting.

7. Hinglish with [conversational] and [impressed]

Code
text = '[conversational] So basically, मैं सोच रहा था — audio tags kitne flexible hain? [impressed] Pretty wild, ना?'
# synth(text, OUT / 'codemix.wav', voice='Kore')   # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore`  \n**Prompt:** `{text}`'))
Audio('gemini_tts/codemix.wav')

Voice: Kore
Prompt: [conversational] So basically, मैं सोच रहा था — audio tags kitne flexible hain? [impressed] Pretty wild, ना?

What surprised me

  • Tags compose mid-sentence. [whispers] attached to a trailing clause produced an actual drop in amplitude, not just a pause. The tag is not a global style knob, it is a phrase-level instruction.
  • Persona tags generalise. [like dracula] and [storyteller, slow] are not on any enum — the model treats the bracketed phrase as a free-text style. That makes the control surface indefinitely extensible.
  • Hindi tag handling works with English tag names. Using [excited] inside a Devanagari sentence steered delivery correctly without translating the tag.
  • Code-mix is handled as one utterance. No phoneme-level artefacts at the language boundary in the Hinglish sample.

Practical notes

  • Model id: gemini-3.1-flash-tts-preview (as of 2026-04-16).
  • Output: 24 kHz 16-bit mono PCM (audio/l16). Wrap in a WAV header for playback.
  • Voices I used above: Kore, Puck, Charon — the prebuilt voice pool is shared with the 2.5 TTS models.
  • Tag syntax is [free-text direction]. Combine with commas ([storyteller, slow]). Place inline, per phrase.