Code
# One-time install
# !uv pip install -U google-genaiNipun Batra
April 16, 2026
Google shipped Gemini 3.1 Flash TTS today — their most expressive and controllable text-to-speech model to date. The headline feature is audio tags: bracketed natural-language directions like [excitement], [whispers], [like dracula] or [storyteller, slow] that you embed inline with the text to steer delivery without a separate voice-config schema.
It supports 70+ languages, with 24 evaluated as high-quality — including Hindi, Japanese and Arabic. This post runs the model end-to-end against the public Gemini API and embeds the resulting audio for English, Hindi and a code-mixed Hinglish prompt.
import os, wave
from pathlib import Path
from IPython.display import Audio, display, Markdown
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
MODEL = "gemini-3.1-flash-tts-preview"
OUT = Path("gemini_tts"); OUT.mkdir(exist_ok=True)
print("SDK ready. Using", MODEL)Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
SDK ready. Using gemini-3.1-flash-tts-preview
The TTS endpoint is a normal generateContent call with response_modalities=["AUDIO"] and a voice config. The model returns raw 16-bit little-endian PCM (audio/l16) at 24 kHz, so we wrap it in a WAV header to keep browsers happy.
def synth(text, out_path, voice="Kore"):
"""Synthesize `text` with `voice` and save as a 24 kHz mono WAV."""
resp = client.models.generate_content(
model=MODEL,
contents=text,
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
)
),
),
)
part = resp.candidates[0].content.parts[0]
data = part.inline_data.data
rate = int([t.split("=")[1] for t in part.inline_data.mime_type.split(";") if "rate=" in t][0])
with wave.open(str(out_path), "wb") as wf:
wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(rate)
wf.writeframes(data)
return out_pathThe tag-free version is a neutral read. Swap in [excitedly] and [whispers] and the same sentence acquires pacing, amplitude changes, and a conspiratorial drop on the second clause.
Voice: Kore
Prompt: Today Google launched Gemini 3.1 Flash TTS, their most expressive and controllable text to speech model yet.
[excitedly] and [whispers]text = '[excitedly] Today Google launched Gemini 3.1 Flash TTS! [whispers] Their most expressive and controllable text to speech model, yet.'
# synth(text, OUT / 'en_tagged.wav', voice='Kore') # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore` \n**Prompt:** `{text}`'))
Audio('gemini_tts/en_tagged.wav')Voice: Kore
Prompt: [excitedly] Today Google launched Gemini 3.1 Flash TTS! [whispers] Their most expressive and controllable text to speech model, yet.
[like dracula] persona tagVoice: Puck
Prompt: [like dracula] I vant to tell you about audio tags. [laughs] They are, how you say, deliciously flexible.
The [like dracula] tag is worth dwelling on: it is not a fixed enum value, it is a free-text style instruction. The model parses the bracketed phrase as a direction and colours the delivery accordingly. This is the same shape as DALL·E-style textual style control, transplanted onto speech.
Hindi is one of the 24 high-quality evaluated languages. The audio tags travel: they are written in English (the control surface) but steer delivery of Devanagari text.
Voice: Kore
Prompt: नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है।
[excited] and [whispers]text = '[excited] नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है। [whispers] यह उनका अब तक का सबसे अभिव्यंजक मॉडल है।'
# synth(text, OUT / 'hi_tagged.wav', voice='Kore') # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Kore` \n**Prompt:** `{text}`'))
Audio('gemini_tts/hi_tagged.wav')Voice: Kore
Prompt: [excited] नमस्ते! आज गूगल ने जेमिनाई तीन पॉइंट वन फ़्लैश टी-टी-एस लॉन्च किया है। [whispers] यह उनका अब तक का सबसे अभिव्यंजक मॉडल है।
[storyteller, slow] + [mysterious]text = '[storyteller, slow] एक समय की बात है, एक छोटे से गाँव में एक बुद्धिमान उल्लू रहता था। [mysterious] लोग कहते थे कि वह सपनों की भाषा जानता था।'
# synth(text, OUT / 'hi_story.wav', voice='Charon') # pre-generated; cached in repo
display(Markdown(f'**Voice:** `Charon` \n**Prompt:** `{text}`'))
Audio('gemini_tts/hi_story.wav')Voice: Charon
Prompt: [storyteller, slow] एक समय की बात है, एक छोटे से गाँव में एक बुद्धिमान उल्लू रहता था। [mysterious] लोग कहते थे कि वह सपनों की भाषा जानता था।
The more useful Indian-language test is code-mix: Hindi-English in a single utterance, which is how people actually talk. The model handles it in one pass — no per-segment splitting.
[conversational] and [impressed]Voice: Kore
Prompt: [conversational] So basically, मैं सोच रहा था — audio tags kitne flexible hain? [impressed] Pretty wild, ना?
[whispers] attached to a trailing clause produced an actual drop in amplitude, not just a pause. The tag is not a global style knob, it is a phrase-level instruction.[like dracula] and [storyteller, slow] are not on any enum — the model treats the bracketed phrase as a free-text style. That makes the control surface indefinitely extensible.[excited] inside a Devanagari sentence steered delivery correctly without translating the tag.gemini-3.1-flash-tts-preview (as of 2026-04-16).audio/l16). Wrap in a WAV header for playback.Kore, Puck, Charon — the prebuilt voice pool is shared with the 2.5 TTS models.[free-text direction]. Combine with commas ([storyteller, slow]). Place inline, per phrase.google-genai · Gemini TTS docs