Gemini 3.5 Flash shipped May 19, 2026 at Google I/O. It runs everywhere Google sells AI — Gemini app, Search AI Mode, AI Studio, Antigravity 2.0, Vertex AI, Workspace — and it's the default model for nearly a billion monthly active users. It beats Gemini 3.1 Pro on coding and agentic benchmarks at roughly 4× the speed and 25% lower cost. The headline announcement was supposed to be 3.5 Pro, which was delayed to June — the audience reportedly groaned. This masterclass covers what Flash actually does, where it ships, and the 20 prompts that get the most out of it.
Section 01
What Actually Shipped at I/O 2026
Google's I/O 2026 keynote was unusually focused. Instead of dropping a dozen models across tiers, Sundar Pichai's team led with two announcements that share the same underlying model. Everything else — AI Mode in Search updates, Workspace AI features, the Intelligent Eyewear preview — orbited those two.
| What shipped | What it is | Where you find it |
|---|---|---|
| Gemini 3.5 Flash | The new flagship Flash-tier model. API ID gemini-3.5-flash. 1M token context, $1.50/$9.00 pricing, 4 thinking levels. |
Gemini app, Search AI Mode, AI Studio, Android Studio, Antigravity 2.0, Vertex AI, Workspace, Gemini Enterprise. |
| Gemini Spark | 24/7 persistent personal AI agent powered by 3.5 Flash + the Antigravity harness. Runs on dedicated Google Cloud VMs. | Beta to Google AI Ultra subscribers in the US ($100/mo). |
| Antigravity 2.0 | Agent-first desktop IDE. Multi-agent orchestration, artifacts (versioned outputs), one-click export from AI Studio, Android integration. | Free download. Up to 12× faster than other surfaces when running 3.5 Flash, per Google. |
| Managed Agents API | Single API call gives you an agent + hosted Linux sandbox with Bash, Python, Node.js, file management, browsing, and markdown-defined skills. | Public preview in the Gemini API. Includes the antigravity-preview-05-2026 general-purpose agent. |
| Gemini Omni | Separate multimodal generation model — any output from any input, starting with video. | Different model from 3.5 Flash. Mentioned here so you don't confuse them. |
| Google Pics | Image generation and editing tool inside Workspace. | Gmail, Docs, Slides, Keep. |
Google's framing for the whole family: “frontier intelligence with action.” The action part is what matters. 3.5 Flash isn't just a faster model — it's designed to execute multi-step workflows under supervision, the way an employee would, rather than answer one question at a time.
3.5 Flash delivers intelligence that rivals large flagship models on multiple dimensions, at the speeds you have come to expect from the Flash series. It is our strongest agentic and coding model yet, outperforming Gemini 3.1 Pro on challenging coding and agentic benchmarks like Terminal-Bench 2.1, GDPval-AA, and MCP Atlas.
Google DeepMind — Gemini 3.5: frontier intelligence with actionSection 02
The 30-Second Demo (Proof of Speed)
Before the deep dive, one before-and-after that shows what “4× faster” actually means in practice. Same task, identical inputs, two different models.
The task: a vibe coder asks for a fully playable Pong implementation in HTML, single file, with paddle physics and a score counter.
Time to first token: ~2.4s
Time to complete (1,200 LOC): ~38s
Tokens/sec output: ~32
Output: complete, runs correctly, paddle physics work,
score counter functional. Reasoning depth visible in the
comments. Production-ready.
Time to first token: ~0.9s
Time to complete (1,180 LOC): ~9s
Tokens/sec output: ~131
Output: complete, runs correctly, paddle physics work,
score counter functional. Slightly leaner comments
(default thinking_level is now medium, not high).
Production-ready.
In Antigravity 2.0 specifically: ~6s total. ~210 tok/sec.
That's the headline. Same quality, roughly 4× the output speed, and the price per million output tokens dropped from $12 to $9. The 12× figure in Antigravity is what Google quoted on stage — it includes IDE-side optimizations that aren't available when running the API through Cursor or VS Code.
Why this matters for vibe coders
Speed at this magnitude changes what's economically rational. Running 20 subagents in parallel was an experiment with previous-generation models; with 3.5 Flash it's a normal Tuesday. The Bun-rewrite-style workflows (hundreds of agents, days of runtime) that Anthropic showcased with Opus 4.8 are equally tractable with 3.5 Flash, and at a third of the cost. This is the unlock.
Section 03
Did Google Deliver What They Promised?
Honest answer: partially.
The I/O 2026 keynote was built around the Gemini 3.5 family, and the headline was supposed to be Gemini 3.5 Pro. That's the model Google has been hinting at since Gemini 3.1 Pro shipped in February. Pichai's exact line on stage, reported by Business Insider: “I know you can't wait to get your hands on it. Give us until next month to get it to you.” The audience reportedly groaned. 3.5 Pro was delayed to June 2026.
What did ship is 3.5 Flash, which Google positioned as a frontier-grade Flash-tier model that beats the previous generation's Pro on most coding and agentic benchmarks. That is genuinely true. It's also a different story from what was promised.
Where Flash is a step forward
Coding and agentic work. 3.5 Flash beats 3.1 Pro on Terminal-Bench 2.1 (76.2% vs 70.3%), MCP Atlas (83.6% vs 78.2%), and Finance Agent v2 (57.9% vs 43.0%). These are the benchmarks vibe coders actually care about for production work.
Speed and price. Roughly 4× the output token rate, 25% cheaper input, 25% cheaper output. The cost-quality curve genuinely shifted in your favor.
Multimodal reasoning. 84.2% on CharXiv (visual reasoning across charts and figures), built on the strong multimodal foundation of Gemini 3. The model handles PDFs, video, and audio natively without conversion.
Tool use reliability. The 83.6% on MCP Atlas is the metric that matters for anyone deploying agents. It still means ~1 in 6 tool calls misfires under adversarial conditions, but that's up from ~1 in 5 on 3.1 Pro. Build retries and validation.
Where Flash regressed (and where 3.1 Pro still wins)
The masterclass has to be honest about this. Flash is not a clean replacement for Pro across the board.
Long-context retrieval at scale. 3.1 Pro still beats 3.5 Flash on MRCR v2 at 128K tokens. If your work is needle-in-haystack analysis across very long documents, stay on 3.1 Pro until 3.5 Pro ships.
Hard reasoning. 3.1 Pro still wins on Humanity's Last Exam and ARC-AGI-2. Multi-step graduate-level science and math reasoning hasn't crossed the Flash-tier threshold yet.
Multilingual nuance. The official Gemini 3.5 documentation notes that Gemini 3.1 Pro retains stronger multilingual performance for languages outside the top tier.
The honest take for your stack
Use 3.5 Flash as the default for coding loops, tool use, agentic workflows, and multimodal work. Keep 3.1 Pro available for long-context document analysis and hard reasoning. Plan to revisit when 3.5 Pro ships in June — it's likely going to close most of these gaps.
Section 04
The thinking_level Dial You Now Have
Gemini 3.5 Flash introduces a thinking_level parameter that replaces the older thinking_budget approach. Four levels, set per request. This is the developer-side equivalent of Claude's effort slider — the single biggest cost-quality lever in the model.
The four levels
| Level | What Gemini does | When to pick it |
|---|---|---|
| minimal | Smallest reasoning budget. Fastest response. Lowest token spend. | High-volume background processing, simple classification, formatting tasks, lookups where you'll verify. |
| low | Some extended thinking on hard sub-steps. Cheaper than default. | Triage, summaries, routine analysis. Default for cost-sensitive agent loops. |
| medium (default) | Balanced reasoning. Google's judged best overall balance for 3.5 Flash. | Most everyday work. The new default — lowered from high in preview to medium in GA. |
| high | Deepest thinking. Highest token spend per response. | Hard coding tasks, complex multi-step reasoning, anything where the cost of being wrong exceeds the cost of more tokens. |
What changed in the migration from preview
Three things vibe coders carrying over from gemini-3-flash-preview need to know:
Default lowered from high to medium. If you were running the preview at default settings and getting acceptable quality, the GA version will run faster and cheaper — but may need an explicit bump back to high if quality drops on hard tasks. Test your prompts.
Thought preservation is now on by default. Reasoning context carries forward across turns automatically. This improves performance on multi-turn agentic loops, but may increase token usage. If you're running tight token budgets, set thought_preservation: false in your config.
Computer Use is not supported on 3.5 Flash at launch. If you were using browser-agent workflows on 3.1 Pro, you cannot port them directly to 3.5 Flash. Wait for 3.5 Pro or keep that subgraph on 3.1 Pro.
Where the slider lives
In the Gemini app: a Fast/Thinking toggle in the model picker. That's the only knob exposed to consumers.
In AI Studio: under the model config panel on the right, a dropdown for thinking_level with all four named options.
In Antigravity 2.0: in the agent settings, plus per-task overrides when you spawn subagents.
In the API: pass thinking_level: "high" (or any of the four) in the request config. Do not pass both thinking_level and the deprecated thinking_budget — the latter is being removed.
The single habit that matters
Pick the thinking_level deliberately, per task. The default medium is rarely wrong, but the cost-quality math shifts dramatically when you tune it. Build the habit of choosing per task: minimal for high-volume background work, low for routine triage, medium as default, high for hard reasoning. Done deliberately once per task type, the model's value goes up while your token spend stays controlled.
Section 05
Vibe Coder Patterns in the Gemini App + Search AI Mode
Six paste-ready prompts for the most common Gemini app workflows. These are the free surfaces. The Gemini app and AI Mode in Search together reach over 900 million monthly active users globally — nearly all of whom are running 3.5 Flash without knowing the model name. The prompts below are equally useful for writers, analysts, lawyers, students, and developers who use Gemini as their daily driver.
1. Run a deep research thread in the Gemini app
Run a deep research thread on this topic.
OUTCOME: a structured briefing I can paste into a decision doc
covering what's known, what's contested, and what's worth
verifying before I act.
CONSTRAINTS:
- 600-900 words total
- Three sections: Current consensus · Where credible sources
disagree · What's still uncertain
- Cite specific sources for every concrete claim (publication +
date)
- Skip generic background ("X has been growing rapidly")
- Use Google Search grounding throughout
CONTEXT: the topic is [your topic]. My background level on this
is [novice / intermediate / expert]. The decision I'm trying to
make: [what you'll do with the answer].
TEST: a good briefing lets me walk into the decision tomorrow
with a defensible position. Every claim should be traceable to
a source I could check.
GATE: if the topic has changed materially in the last 90 days,
prioritize recent sources. If I'm asking about something
contested (politics, health, finance), surface the disagreement
explicitly rather than picking a side. Flag any claim where
you can't find a credible source.
Why this works: the “cite specific sources for every concrete claim” line activates Gemini's Search grounding behavior, which is the main quality lever for research in the Gemini app. Watch for: if you don't name your background level and the decision context, the briefing reads as generic. The personalization context is what makes the depth right.
2. Summarize a long PDF or video into a brief
Summarize this attached document/video into a brief I can use.
OUTCOME: a structured summary that captures what actually matters,
not what the author emphasized.
CONSTRAINTS:
- Three sections: TL;DR (3 sentences) · Key Points (5-8 bullets) ·
What's Not Said (3 bullets — assumptions, gaps, missing context)
- Numbers over adjectives. Every claim cites a specific figure
or timestamp.
- Skip the boilerplate intro and the conclusion section
- For video: include rough timestamps for the key points
CONTEXT: the attached file is [PDF / video / audio]. My purpose
in summarizing: [what you'll do with the summary]. The audience
for the summary: [me / my team / my exec].
TEST: a good summary lets the audience above answer "so what?"
in 30 seconds. The "What's Not Said" section should surface
the things you'd ask about if you were genuinely interested.
GATE: if the document is partly outside the model's expertise
(highly technical, legal, medical), flag the sections where you
hedged. If the document references other documents I haven't
shared, note what you're missing.
Why this works: “What's Not Said” is the section that separates a real summary from a regurgitation. It's the section 3.5 Flash does well that most other models hedge on. Watch for: on videos longer than 30 minutes, 3.5 Flash sometimes drops the middle. Ask explicitly for coverage across the full timeline if completeness matters.
3. Triage a Gmail inbox into a daily digest
Triage my inbox from the last 24 hours into a daily digest.
OUTCOME: a single-page digest sorted by what needs my attention
today, what's informational, and what I can safely archive.
CONSTRAINTS:
- Three sections: Needs my attention today · Informational (FYI) ·
Safe to archive
- For "needs attention": include sender, subject, the specific
ask, and a draft 2-sentence response if the action is "reply"
- For "informational": one-line summary only
- For "safe to archive": just a count and a category breakdown
(newsletters, automated, social, etc.)
- Skip anything I sent or replied to myself
CONTEXT: my role is [your role]. My top 3 priorities this week:
[list three]. People I should always treat as priority regardless
of subject: [list names].
TEST: a good digest lets me act on everything in the "Needs
attention" list in under 30 minutes. If that list has more than
10 items, you've been too generous — re-sort.
GATE: if anything looks like a security threat (suspicious
sender, unusual urgency, financial request from unverified
source), surface it at the top with a clear warning instead of
slotting it into the regular triage. Don't help me phish myself.
Why this works: the three-bucket structure (attention/FYI/archive) is the single biggest unlock for inbox sanity. The draft response in the attention bucket compounds: half the day's replies are already written. Watch for: Gemini needs Workspace integration turned on in app settings. Without it, the prompt produces a hypothetical digest, not a real one.
4. Draft a Doc in your voice with Workspace context
Draft this Doc in my voice, using context from my Drive.
OUTCOME: a Google Doc draft I can finish in 10 minutes of light
editing, not a generic template I have to rewrite.
CONSTRAINTS:
- Match the voice of [reference Doc title in my Drive]
- Same structure as [reference Doc title], adapted to this topic
- Inline citations to source Docs in my Drive using the [Doc Title]
format so I can verify each claim
- Length: roughly [target word count]
- Tone: [formal / conversational / technical]
CONTEXT: the topic is [your topic]. The audience is [audience].
The decision or action this Doc supports: [outcome]. Reference
materials in my Drive folder: [folder name].
TEST: a good draft sounds like me, cites what I'd cite, and
makes the same argument I'd make. The 10-minute edit should be
polish, not rewriting.
GATE: if the reference Doc I named doesn't exist in my Drive or
isn't accessible, surface that before drafting (don't fabricate
a "similar" voice). If the source folder is empty or doesn't
match the topic, ask me where the right source materials live.
Why this works: the named reference Doc is what teaches voice in seconds. Gemini in Docs reads the reference and pattern-matches structure + tone simultaneously. Watch for: if your reference Doc is too short or generic, you get a generic draft. Pick reference Docs that contain something distinctive only you would write.
5. Build a Spark routine that runs daily
Set up a Spark routine that runs every weekday at 7am.
OUTCOME: a persistent agent that handles a repeatable task for me
in the background, delivering output where I want it.
CONSTRAINTS:
- The task to repeat: [describe what you want it to do]
- The trigger: [time / event / condition]
- The output location: [send to my inbox / append to a Doc / Slack me]
- The data sources to access: [Gmail / Calendar / specific Drive
folder / external URL]
- The stop condition: [when to skip a day]
CONTEXT: I want this off my plate so I can stop thinking about
it. Currently I do this manually: [describe the manual process].
What it looks like when done right: [describe success].
TEST: a good routine runs reliably for 30 days without my
intervention. The output should look like what I'd produce
manually, but consistently.
GATE: before activating the routine, walk me through exactly what
data it will access, what it will do with that data, and where
it will send the output. Confirm I want to grant the integrations
listed. Run a dry-run on yesterday's data before going live.
Why this works: the explicit dry-run gate is critical. Spark has access to your real Gmail, Drive, and Calendar — an agent that misunderstood the task can cause real damage. The dry-run lets you verify before the daily run goes live. Watch for: Spark requires Google AI Ultra ($100/mo) and US-only beta at time of writing. Routines that try to take destructive action (delete, send external email) require explicit per-action confirmation by default.
6. Generate a generative-UI experience in Search AI Mode
Build me an interactive visual that explains [concept] for
my [audience].
OUTCOME: an interactive in-search experience I can use to
understand the concept hands-on, not just read about it.
CONSTRAINTS:
- The concept must be visually decomposable (not just text)
- Allow me to manipulate at least one variable and see the
output change
- Cite at least three reliable sources alongside the interactive
- Mobile-friendly (most people will see this on a phone)
CONTEXT: the concept I want to understand: [your concept]. My
background level: [novice / intermediate / expert]. What I'll
do after understanding it: [decision / explanation / project].
TEST: a good experience teaches me the concept faster than a
3-minute video would. After interacting with it for 60 seconds
I should be able to explain the concept to someone else.
GATE: if the concept isn't well-suited to interactive visual
explanation (it's abstract, ethical, or contested), say so and
suggest an alternative format. Don't force a bad visual onto
a concept that needs prose.
Why this works: Search AI Mode's generative-UI capability is one of the most under-used features in the Gemini ecosystem. Asking for an interactive visual unlocks it. Watch for: the feature only works on concepts that genuinely decompose visually. Asking for an interactive on “the philosophy of free will” produces something forced. Asking for an interactive on “how gyroid patterns work” produces something brilliant.
Section 06
Vibe Coder Patterns in Google AI Studio
AI Studio is where vibe coders prototype before committing to production. Six prompts for the patterns that pay off here: structured JSON outputs, function calling chains, search-grounded queries, multimodal vision audits, feature spec prototyping, and the one-click export to Antigravity 2.0 for production deployment. These run at developer-grade thinking_level settings — high for hard work, medium for the rest.
7. Prototype a feature spec in AI Studio
Prototype this feature as a spec I can hand to engineering.
OUTCOME: a one-page spec covering the user story, the data model,
the API surface, the failure modes, and the acceptance criteria.
Engineering should be able to estimate from it.
CONSTRAINTS:
- Markdown structure, fits on roughly one screen when rendered
- Five sections: User story · Data model · API surface (endpoint
shapes, not implementation) · Failure modes · Acceptance criteria
- For each failure mode: how it manifests, what the user sees,
what the system should do
- For acceptance criteria: testable conditions, not aspirations
CONTEXT: the feature in plain language: [describe]. The system
it lives in: [describe stack]. The users: [describe]. The
constraint that's bounded similar features in this system:
[the gotcha that keeps biting].
TEST: a good spec gives engineering everything they need to start
without coming back to me with clarifying questions. Test it: if
you had to estimate this in story points right now, could you?
GATE: if the feature can be cleanly decomposed into smaller
shippable units, propose the split before drafting one big spec.
If the data model has implications for an existing table or
service, flag the migration concern explicitly.
Why this works: the “could you estimate this in story points right now?” test in the TEST block is what separates a real spec from a wishlist. Watch for: if you don't name the constraint that's bounded similar features, the spec ignores your real gotchas and reads as boilerplate.
8. Generate a structured-output JSON pipeline
Build a structured-output JSON pipeline that processes
[input type] into [output type].
OUTCOME: a tested, schema-valid extraction pipeline I can call
from production code. JSON only — no surrounding prose.
CONSTRAINTS:
- Use response_mime_type: "application/json" and a response_schema
- Schema-valid output every time. No prose, no markdown fences.
- Required fields fail loudly; optional fields are explicitly
nullable
- Include 3 test inputs (happy path, edge case, adversarial)
and the expected output schema for each
CONTEXT: the input type is [describe — natural language, scraped
HTML, OCR text, etc.]. The downstream consumer needs [describe
what the JSON feeds into]. The current state: [no extraction
yet / hand-rolled regex / different LLM].
TEST: schema-valid on 100% of the test inputs. The adversarial
input should produce a clearly-marked error response in the
same schema, not a hallucinated success.
GATE: if the response schema has ambiguous fields (free-text
strings where I should be using enums, dates without timezone),
propose tighter typing before generating the pipeline. If the
input is sometimes too messy for reliable extraction, surface
that as a recommendation to preprocess upstream.
Why this works: 3.5 Flash's native structured-output support is one of its strongest production capabilities. The explicit adversarial test forces the pipeline to fail loudly instead of fabricating. Watch for: the schema validation only catches structural problems. Semantic errors (right shape, wrong content) still need downstream validation.
9. Build a function-calling tool chain
Build a function-calling agent that completes [task] using
these tools.
OUTCOME: a working agent that plans, calls tools in the right
order, verifies results, and reports back. Production-deployable.
CONSTRAINTS:
- The tools available: [list with signatures]
- The success condition: [how we know we're done]
- The retry policy: [N retries, then fail loudly with context]
- Use parallel function calling where the calls are independent
- Validate every function response before passing to the next
tool (id, name, response count must match per Gemini 3.5 spec)
CONTEXT: the task is [describe]. Existing systems the agent must
not touch: [list]. The cost ceiling per task: [token or dollar
budget]. The user who'll run this: [who, with what permissions].
TEST: the agent completes the happy-path task in under [time
budget] with under [token budget] tokens. The retry path
exercises in adversarial conditions.
GATE: this is tool-use territory. MCP Atlas benchmark says
roughly 1 in 6 tool calls in adversarial conditions misfires
on 3.5 Flash. Build retries, validate every response, and fail
loudly with full context on the third retry. Do not silently
succeed when the tool call failed.
Why this works: Gemini 3.5 introduced strict matching for function call responses (id, name, and response count must match). The explicit validation in CONSTRAINTS prevents the most common production bug. Watch for: the ~17% tool-call failure rate is real. Plan for it. The agent that silently succeeds on a failed tool call is the agent that ships incidents.
10. Ground a query with Google Search-as-a-tool
Answer this question using Google Search grounding.
OUTCOME: a researched answer with inline citations to the
sources Search returned. Every concrete claim is traceable to
a specific URL.
CONSTRAINTS:
- Enable google_search as a tool
- Inline citations in the format [Source: domain.com] after
every factual claim
- If sources disagree, surface the disagreement explicitly
(don't pick one and hide the other)
- Prioritize primary sources over aggregators when both are
available
- Skip claims you can't ground in a source (better to say "I
couldn't find evidence for X" than fabricate)
CONTEXT: the question is [your question]. The recency requirement:
[any answer / past 90 days / past week]. The trust level needed:
[casual / professional / regulatory].
TEST: a good answer is defensible — every concrete claim is
traceable to a credible source. The citations should be real
URLs, not fabricated.
GATE: if the question is contested (politics, health, finance,
science with active debate), present the spectrum of credible
positions rather than synthesizing them into a single answer.
If recent reporting contradicts older consensus, weight by date
and explain.
Why this works: Search-as-a-tool is one of 3.5 Flash's killer features in AI Studio — native Google Search grounding without external infrastructure. The inline citation format makes verification trivial. Watch for: grounding metadata occasionally returns broken URLs. Validate citations downstream before publishing.
11. Run a multimodal vision audit on screenshots
Audit these screenshots against [criteria] and report
findings.
OUTCOME: a structured list of findings, each tied to a specific
screenshot, with the offending element identified by position
(approximate bounding region) or text content.
CONSTRAINTS:
- Process each screenshot in order
- For each finding: screenshot #, finding, severity, suggested fix
- Severity scale: critical (blocks user / breaks experience), high
(visible bug), medium (polish issue), low (nit)
- Group findings by criterion at the end
- Skip findings that aren't material (don't carpet-bomb me with
pixel-level nits)
CONTEXT: the criteria to audit against: [paste criteria — could
be brand guidelines, accessibility standards, UX heuristics,
spec compliance]. The artifact being audited: [describe — web
app, mobile app, print design].
TEST: a good audit catches the violations that exist and doesn't
flag things that look like violations but aren't. If a single
category has 20+ findings, you're being too generous — re-rank.
GATE: this is multimodal vision work where 3.5 Flash scores
84.2% on CharXiv reasoning. That's strong but not perfect.
Distinguish between findings you're confident about (severity:
high/critical) and findings that need a human to verify
(severity: medium/low). Be explicit about which is which.
Why this works: 3.5 Flash's CharXiv strength makes it genuinely useful for visual audits — brand compliance, accessibility checks, design reviews. The severity rubric prevents the most common failure (drowning in nits). Watch for: the model is excellent at finding text and UI elements but weaker on subtle color/contrast nuance. Pair with an actual accessibility scanner for ADA compliance work.
12. Export AI Studio to Antigravity for production
This prompt is tested. Export to Antigravity for production
deployment.
OUTCOME: a working Antigravity project I can run locally,
extend, and deploy. Same behavior as the AI Studio prototype
but with proper structure (config, secrets, error handling,
logging).
CONSTRAINTS:
- One-click export from AI Studio (use the Export to Antigravity
button)
- Verify the project runs on my machine before I extend it
- Production essentials: environment-variable config (not
hardcoded keys), structured logging at INFO/WARN/ERROR,
retry logic on tool calls, a README with run instructions
- Same model + thinking_level + tools as the prototype
CONTEXT: the AI Studio prompt I'm exporting from: [paste or
reference]. The runtime target: [local dev / cloud / Antigravity
managed agent]. The team that'll maintain this: [solo / small
team / production].
TEST: clone the exported project, install deps, set env vars,
run the example invocation. Output matches what AI Studio
produced in the prototype.
GATE: if any AI Studio feature I used doesn't have a clean
production equivalent (interactive vision feedback, ad-hoc
prompts), flag it before export. Don't silently drop
functionality in translation.
Why this works: the one-click AI Studio → Antigravity export is one of the most underrated workflow features Google shipped in Antigravity 2.0. It removes the friction that used to cost vibe coders a day of plumbing per prototype. Watch for: the export is great for code + config but doesn't preserve interactive prompt iteration history. Save your AI Studio session URL alongside the exported project.
Section 07
Vibe Coder Patterns in Antigravity 2.0
Antigravity 2.0 is where Gemini 3.5 Flash does its hardest work. Eight prompts for the developer patterns where the IDE's multi-agent orchestration pays off the most: scaffolding, refactoring, parallel codebase analysis, builder-player game loops, adversarial PR review, Android app generation, Managed Agents API, and codebase audits. Most run with subagents because that's the unique unlock. The 12× speed multiplier in Antigravity comes from IDE-side optimizations the API alone doesn't get.
- Scaffold a new app with conventions
- Refactor a legacy codebase to Next.js
- Deploy parallel subagents for codebase analysis
- Build a builder-player game loop
- Adversarial review a pull request
- Generate an Android app skeleton
- Run a Managed Agent for an autonomous task
- Audit a Workspace tenant with structured outputs
13. Scaffold a new app with conventions
Scaffold a new [type of app] following our team conventions.
OUTCOME: a working project skeleton I can `cd` into and run
within 60 seconds, structured to match how we build things here.
CONSTRAINTS:
- Match the directory structure of [reference project path]
- Reuse existing tsconfig / pyproject / package.json patterns
- Include the same test runner setup we use elsewhere
- Add an AGENTS.md and CLAUDE.md / GEMINI.md tuned to this app
- Do NOT install new dependencies unless absolutely required;
prefer what's already in the lockfile
CONTEXT: read AGENTS.md and GEMINI.md first for team conventions.
Look at [reference project path] for the pattern to mirror.
The app I want to scaffold: [describe what it does].
TEST: after scaffolding, I should be able to run the test command
from the README and get a passing empty test suite. The project
should pass our linter and type checker with zero errors.
GATE: before creating files, list every file you're about to
create and wait for my OK. Do NOT modify any existing file in
this repo without explicit permission. If the existing
conventions disagree across the codebase, ask which to follow.
Why this works: the “list every file first, wait for OK” line in the GATE is what saves you from an unwanted 200-file scaffold. Antigravity 2.0 honors this reliably because the IDE shows you the planned file tree before execution. Watch for: if your GEMINI.md is empty or weak, the scaffolded project won't match your conventions. Fix the project memory first if the scaffold misses repeatedly.
14. Refactor a legacy codebase to Next.js
Refactor [legacy framework] codebase at [path] to Next.js.
OUTCOME: the codebase is converted to Next.js with the existing
test suite still passing, the existing user-facing behavior
preserved, and the migration broken into reviewable chunks.
CONSTRAINTS:
- Deploy subagents in parallel across independent files
- File-by-file migration with two reviewers per file (one for
correctness, one for adversarial behavior preservation)
- Each migrated file gets reviewed before being marked complete
- Use the existing project conventions for Next.js (read
GEMINI.md, look at any pilot files already migrated)
- Save progress as you go — interrupted runs should resume from
the last checkpoint
CONTEXT: the legacy framework: [Vue / Angular / Backbone / raw
HTML]. The path: [path]. The test suite: [path to tests]. Pilot
files already migrated: [paths or "none yet"]. The deployment
target: [Vercel / self-hosted / etc].
TEST: the migration is complete when every source file is in
Next.js, the existing test suite passes on the migrated
codebase, the linter is clean, and the build produces equivalent
output.
GATE: before fanning out subagents, propose the migration plan:
the dependency graph, the wave order, the per-file budget. Wait
for my approval. If during the run a wave reveals coupling that
prevents parallel work, STOP that wave, surface it, and ask
whether to redesign. Do not ship a half-migrated codebase as
"making progress."
Why this works: Google demoed exactly this workflow at I/O 2026 — 3.5 Flash refactoring a legacy codebase to Next.js using parallel subagents in Antigravity. It works because the test suite is the bar. Watch for: Antigravity's subagent runs consume meaningfully more tokens than single-agent sessions. Start scoped (one subsystem) before going codebase-wide.
15. Deploy parallel subagents for codebase analysis
Deploy parallel subagents to analyze this codebase against
[criterion].
OUTCOME: a consolidated report covering every part of the
codebase, with findings organized by severity and category.
Independent verification on every finding before it reaches
the report.
CONSTRAINTS:
- Split the codebase by directory; one subagent per top-level
directory or service
- Each subagent reports findings in a structured format
(file:line, severity, category, suggested fix)
- Adversarial verification subagent reviews each finding —
false positives get removed before the final report
- Aggregate report deduplicates findings that span multiple
subagents
CONTEXT: the codebase root: [path]. The criterion to analyze
against: [security / performance / dead code / convention
compliance / etc]. Examples of what counts as a finding: [paste
2-3 examples]. Examples of false positives: [paste 1-2 examples].
TEST: a good analysis finds the real issues and doesn't fill
the report with style nits. If a category has 100+ findings,
it's probably a false-positive class — flag it and ask whether
to refine.
GATE: before deploying subagents, show me the directory split
plan and the per-subagent budget. If two directories share so
much coupling that a parallel analysis will produce duplicates,
say so and suggest a different split.
Why this works: this is the prompt that maps directly to Google's demoed multi-agent orchestration pattern. The adversarial verification step is what keeps the report signal-to-noise high. Watch for: if your codebase has heavy circular dependencies between top-level directories, the per-directory split breaks down. Ask for a service-level or domain-level split instead.
16. Build a builder-player game loop with two agents
Build [game / interactive app] using a builder-player loop
with two collaborating agents.
OUTCOME: a working playable artifact developed through rapid
self-improvement. The builder agent writes code; the player
agent plays / tests / breaks it. They iterate until convergence.
CONSTRAINTS:
- Agent A (builder): writes code to satisfy the spec
- Agent B (player): plays the game / exercises the interface,
finds bugs, edge cases, and unsatisfying mechanics; reports
back to the builder
- Cycle until the player can't find new issues or the budget
is exhausted
- Final output: working artifact + a log of the iteration cycle
so I can see what was discovered and fixed
CONTEXT: the game/app to build: [describe in detail — mechanics,
controls, win condition, tech stack]. The target experience:
[describe what good looks like]. The token / time budget:
[bounds].
TEST: a good final artifact is genuinely fun or genuinely useful,
not just functionally complete. The iteration log should show
the player agent finding real issues and the builder agent
fixing them.
GATE: before starting the loop, both agents share their
understanding of the spec. If they disagree on what's being
built, resolve that before iterating. Stop the loop if the
player runs out of meaningful findings — don't iterate for
the sake of token spend.
Why this works: Google demoed this exact pattern at I/O 2026 — two agents (builder + player) collaborating in a self-improvement loop to develop a game. It's the simplest non-trivial multi-agent pattern and translates well beyond games (UI/UX with reviewer agent, API design with consumer agent, docs with reader agent). Watch for: if the player agent is too forgiving, you get a working-but-bland artifact. Tell the player to be adversarial and specific.
17. Adversarial review a pull request
Adversarially review the diff in [PR URL or branch name].
Find what's wrong with it.
OUTCOME: a findings list ranked by severity, each finding tied
to a specific line, with the fix described in one sentence.
Don't praise. Findings only.
CONSTRAINTS:
- Severity scale: critical (production breakage), high (latent
bug), medium (maintainability), low (style/nit). Don't pad
with low-severity items.
- For each finding: file:line, severity, one-sentence issue,
one-sentence fix
- End with one of two verdicts: READY TO MERGE or DO NOT MERGE:
[reason]. No middle ground.
CONTEXT: the PR description is [paste description]. The
specification it was supposed to satisfy: [paste spec or link].
Existing code patterns: read GEMINI.md and the files adjacent
to the diff.
TEST: a good review finds the thing the author missed. If the
diff genuinely has no issues, say so — but don't invent issues
to fill a quota.
GATE: this is adversarial mode. You did not write this code.
Your job is to find what's wrong with it. If the diff matches
the spec and looks correct, your finding should be "DO NOT
MERGE: the spec itself is insufficient" or "READY TO MERGE."
Don't soften findings to be polite. Don't hedge. Speak in
observations, not suggestions.
Why this works: the explicit role assignment (“you did not write this code”) flips 3.5 Flash from helpful-collaborator mode into reviewer mode. Same model, very different behavior. Watch for: run this in a fresh Antigravity session, not the one that wrote the diff. Cross-session adversarial review catches more than same-session review.
18. Generate an Android app skeleton with Android Studio
Generate a Kotlin Android app skeleton for [app concept].
OUTCOME: a buildable Android Studio project with the core
screens scaffolded, a working build, and a runnable APK on the
emulator.
CONSTRAINTS:
- Kotlin + Jetpack Compose (no XML layouts unless I ask)
- Material 3 design system, light + dark mode support
- Single-module architecture for now (don't pre-modularize)
- Include navigation between the core screens
- Include one example unit test and one example UI test
- Min SDK 24, target latest stable
CONTEXT: the app concept: [describe]. The 3-5 core screens:
[list]. The single most important user flow: [describe end
to end]. My experience level with Android: [novice /
intermediate / experienced].
TEST: clone the project, open in Android Studio, sync Gradle,
run on emulator. The core flow above works end-to-end. Light
and dark mode both render correctly.
GATE: if the app concept needs platform-specific permissions
(camera, location, microphone, contacts), surface them with the
manifest entries before generating. If the concept fundamentally
needs a backend (auth, sync, storage), scope this prompt to the
client only and propose the backend separately.
Why this works: Antigravity 2.0's native Android Studio integration is one of the most under-marketed features Google shipped. The model knows Android conventions deeply because Android Studio is the dogfooding surface. Watch for: the generated project will compile but you'll likely need to update Gradle dependencies to the latest stable as of your generation date.
19. Run a Managed Agent for an autonomous task
Spin up a Managed Agent to autonomously complete [task] in
its sandbox.
OUTCOME: the agent executes the task to completion, reports
back with the artifacts produced, and shuts down cleanly. No
manual intervention required.
CONSTRAINTS:
- Use the antigravity-preview-05-2026 managed agent model
- Sandbox environment: Linux with Bash, Python, Node.js,
browsing, file management
- Mount the necessary repo or GCS bucket so the agent can read
source materials
- Define custom skills as markdown files for any task-specific
patterns
- Budget cap: [tokens or time] — fail the run if exceeded
CONTEXT: the task: [describe in detail]. Source materials the
agent needs access to: [repo / bucket / URLs]. Output destination:
[where the artifacts go when complete]. Success criteria: [how
we know it worked].
TEST: a good run produces the expected artifacts at the output
destination and a clean exit log. The agent should not have
asked for human intervention beyond the initial spec.
GATE: this agent runs autonomously in its sandbox. Before
launching, walk me through every tool it will use, every
resource it will access, and every output location it can
write to. Confirm I've granted the necessary IAM permissions
on the mounted resources. If the task isn't well-suited to
autonomous execution (requires judgment calls a human should
make), say so before launching.
Why this works: Managed Agents is one of the most consequential developer-facing launches at I/O 2026 — a single API call gives you an agent + hosted sandbox without provisioning anything. The explicit permission walkthrough in the GATE prevents the most common operational failure. Watch for: managed agents in public preview have generous-but-finite resource limits. Don't plan production workloads on them until GA.
20. Audit a Workspace tenant with structured outputs
Audit this Google Workspace tenant against [policy or criterion]
and produce a structured findings report.
OUTCOME: a JSON report (schema-valid) covering every account /
Drive / file / setting that violates the policy, with severity,
remediation step, and the API call needed to fix.
CONSTRAINTS:
- Use the Admin SDK and Drive API via function calling
- Output as JSON conforming to a defined schema (no prose)
- For each finding: account/resource ID, finding category,
severity, remediation step, the exact API call to remediate
- Group by severity at the report root, then by category
- Skip user accounts that are dormant (last activity > 90 days)
unless the policy explicitly covers them
CONTEXT: the tenant domain: [domain]. The policy or criterion:
[paste — could be security baseline, compliance requirement,
license audit, etc]. Examples of violations: [paste 2-3]. Things
that look like violations but aren't: [paste 1-2].
TEST: schema-valid JSON, every finding traceable to a specific
resource, the remediation steps are real API calls that would
work if executed. Run a sample remediation on one low-severity
finding to verify the API call format.
GATE: this audit reads sensitive tenant data. Confirm my OAuth
scopes are correct before starting. Do NOT execute any
remediation API calls during the audit — the report is read-only.
If the audit surfaces something that looks like an active
security incident (data exfiltration, credential exposure),
elevate it to the top of the report with a clear warning.
Why this works: this is where 3.5 Flash's structured-output strength + function calling reliability + Workspace integration all compound. The strict separation between audit (read) and remediation (write) is non-negotiable for any real tenant work. Watch for: the Admin SDK has rate limits. A full tenant audit of 1,000+ accounts should be paginated and resumable.
Section 08
Gemini Spark: The 24/7 Personal Agent
Gemini Spark is the most consumer-facing thing Google shipped at I/O 2026 that you can't use yet unless you're a US Google AI Ultra subscriber at $100/month. It's also the clearest signal of where Google thinks personal AI is heading. Unlike a chatbot you open and close, Spark runs on dedicated virtual machines on Google Cloud and keeps working in the background even when you close your laptop or lock your phone.
The three primitives
Spark's entire mental model rests on three concepts. Once these click, the whole product makes sense.
| Primitive | What it is | Example |
|---|---|---|
| Tasks | Discrete units of work Spark performs by connecting to Workspace tools — Gmail, Calendar, Docs, Sheets, Slides. | “Parse this month's credit card statements and flag new or hidden subscription fees.” |
| Skills | Reusable behaviors you teach Spark once, then call by name. Skills define how the model should approach a class of task. | “The school-update digest skill: check inbox for updates from kids' school, extract deadlines, send daily summary to my partner and me.” |
| Schedules | Time or event triggers that fire Tasks or Skills automatically. Cron-like, but with natural language conditions. | “Every weekday at 7am” or “When a new email from the bank arrives.” |
What Spark connects to (off by default)
Spark's integrations are explicitly opt-in. You turn each one on in settings, and Spark only acts on the surfaces you authorize. The connected-by-default list:
- Gmail — read, search, summarize, draft, send (with confirmation gates for outbound)
- Calendar — read events, propose times, create or modify events
- Drive — read files in folders you grant access to, create new files
- Docs — read, draft, edit, comment
- Sheets — read, draft, edit, formulas, charts
- Slides — read, draft, edit, image generation via Google Pics
- YouTube — search, summarize, queue, history
- Google Maps — routes, places, location-based triggers
What Spark genuinely does well
Repeatable knowledge work. Things you do every Monday, every month-end, every time a specific email arrives. Spark turns that recurring cognitive load into a routine you maintain rather than execute.
Background processing while you sleep. The dedicated VM means Spark can churn through long-running work overnight: research compilations, calendar prep for tomorrow, weekly digest creation, financial reconciliation across statements.
Multi-step Workspace workflows. Read from Gmail, summarize, populate a Sheet, draft a Slide deck, send a Doc to the team. Each step is a Workspace API call; Spark orchestrates the chain reliably because the integrations are structured, not screen-reading.
What Spark genuinely can't do yet
Three honest limits the masterclass won't hide:
Anything outside the Workspace + Maps + YouTube ecosystem. No Slack, no Notion, no GitHub, no Linear, no your-bank-of-choice. The MCP integration story for third-party apps is coming but isn't real yet.
Anything requiring real-time interactive intelligence. Spark is designed for asynchronous, scheduled, or triggered work. Live conversation with you happens in the regular Gemini app, not in Spark.
Anything destructive without explicit confirmation. Delete, send-external-email, transfer-funds — all require per-action approval. This is a feature, not a bug, but it means Spark can't fully replace a human assistant who has standing authority to act.
The Spark price reality
Spark requires Google AI Ultra at $100/month and is US-only at beta. For most vibe coders, that's a meaningful commitment. The math works if you currently pay for: ChatGPT Plus ($20), Claude Pro ($20), and at least one automation tool. The math doesn't work if Spark is your only personal AI spend — the regular Gemini app at $20/month (Advanced) covers most knowledge worker needs.
Section 09
Antigravity 2.0 + Subagents: The Multi-Agent Story
Antigravity is Google's agent-first desktop IDE. Version 2.0 shipped at I/O 2026 alongside 3.5 Flash and is the surface where the model runs fastest — Google quotes up to 12× the speed of other surfaces, which comes from IDE-side optimizations that aren't available when running the API through a third-party tool. It's a free download. If you're serious about vibe coding with Gemini, this is your daily driver.
What changed from 1.x
Three architectural shifts that matter for how you work:
Multi-agent orchestration as a first-class primitive. You can spawn subagents directly from the main agent context. Each subagent has its own conversation, its own context window, its own working files. The parent agent coordinates and aggregates results. This is the same pattern Anthropic ships in Claude Code — Google's implementation is just as good and notably faster.
Artifacts. Versioned outputs you can iterate on without polluting your repo. Generate a doc, refine it, fork it, compare versions, promote the winning one. Artifacts solve the “the agent wrote 12 things and I don't know which one is current” problem.
One-click export from AI Studio. Prototype a prompt or agent in AI Studio, hit the Export to Antigravity button, get a structured project with the same model, the same tools, the same thinking_level, but with production scaffolding (env vars, error handling, logging). This used to be a day of plumbing. Now it's a click.
The multi-agent patterns Google demoed at I/O
Three demos directly from Google's I/O 2026 keynote, all using 3.5 Flash + Antigravity 2.0:
| Demo | Pattern | What ran |
|---|---|---|
| AlphaZero in 6 hours | Two collaborating agents — one synthesizing the AlphaZero paper, one coding the game from the synthesis | Builder + reader pair. Full playable game with self-play in 6 hours of agent time. |
| Legacy → Next.js migration | Parallel file-by-file subagents with adversarial review | Mid-size codebase converted with the existing test suite still passing. |
| Builder + player game loop | Two-agent self-improvement loop — builder writes code, player exercises it, builder iterates | Game developed through rapid self-improvement until the player ran out of meaningful findings. |
The patterns that actually pay off in your work
The demos are flashy. The patterns that pay off most often in real vibe coding work are simpler than the demos suggest:
Read + summarize subagents. Before the main agent writes anything, spawn a subagent to read the codebase, the docs, and the relevant tickets. The summary becomes the main agent's context. Better than RAG because the summarization is task-aware.
Adversarial review subagent on every PR. Same model as the builder, but spawned as a fresh subagent with the explicit instruction to find what's wrong with the diff. Catches more than same-session review.
Parallel exploration when the path forward is unclear. Spawn three subagents to try three different approaches to a problem. Compare results, promote the best, discard the others. Cheaper than serial trial-and-error because you compress time, not tokens.
Long-running background subagents. Spawn a subagent to keep monitoring a build, a test run, or an API. The main agent stays interactive while the background subagent reports in when something happens.
The rule of thumb that works
Spawn subagents when (a) the work decomposes naturally, (b) the pieces are independent enough to run in parallel without coordination, and (c) the cost of running them is less than the cost of coordinating them. Three to five subagents is the sweet spot. Twenty subagents is a demo, not a practice.
Section 10
The 5-Block Intent Recipe
Every paste-ready prompt in this masterclass follows the same recipe. Five blocks, in order, every time. The recipe matches how Gemini 3.5 Flash actually reasons — it orients around the goal first, then attends to constraints, then to context, then verifies against the test, then surfaces uncertainty at the gate. The same recipe works for Claude Opus 4.8, GPT-5.5, and any other frontier model.
The 5 blocks
1. State the outcome. Open with what you want when this is done. One sentence. The model orients around the goal first.
2. Name the constraints. List anything that bounds the answer: stack, audience, length, tone, files in scope, things to avoid. Bullets work fine.
3. Point to context. Reference the files, URLs, prior decisions, or attached documents the model should treat as authoritative. Less is sharper.
4. Declare the test. How will you know the answer is correct? Sometimes this is a literal test to pass. Sometimes it's a checklist.
5. Set the verification gate. Tell the model to flag uncertainty, ask before assuming, and stop at a clear checkpoint.
Why each block matters specifically for 3.5 Flash
The outcome block matters more on 3.5 Flash than on 3.1 Pro because the default thinking_level dropped from high to medium. With less default thinking, the model spends less time figuring out what you want. Telling it upfront saves the tokens.
The constraint block matters because Flash defaults to brevity. Without explicit bounds, you get a competent-but-generic answer. The constraints are where you turn generic into yours.
The context block matters because Flash's 1M-token window is real. Don't hesitate to dump entire files in the context. The model handles it — though long-context retrieval past 200K is where 3.1 Pro still wins.
The test block matters because Flash's 83.6% MCP Atlas means ~1 in 6 tool calls fails. The test is what catches a failed call before it ships as a real-world incident.
The gate block matters because Flash will quietly do what you asked. If you didn't ask the right thing, the gate is your last chance to find out before the work is done.
The same recipe in 30 seconds
OUTCOME: [one sentence on what done looks like]
CONSTRAINTS:
- [bound 1]
- [bound 2]
- [bound 3]
CONTEXT: [reference the files, URLs, prior decisions, or docs to
treat as authoritative]
TEST: [how I'll know the answer is correct]
GATE: [where to stop and flag uncertainty; what to ask before
assuming; the line you don't cross without my OK]
Section 11
Pricing & Cost Reality
The marketing line is “cheap, fast Flash.” The reality is more nuanced. 3.5 Flash is the most expensive Flash-tier model Google has ever shipped — about 3× the price of Gemini 3 Flash Preview and 6× the price of Gemini 3.1 Flash-Lite. But it's also roughly 25% cheaper than Gemini 3.1 Pro on both input and output. Net: this is a Flash tier with frontier intelligence priced as a premium Flash, not a budget Flash.
The official pricing
| Token type | Price (USD per million) | Note |
|---|---|---|
| Input | $1.50 | Non-global regions: $1.65 |
| Cached input | $0.15 | 90% discount on prompts the model has seen recently |
| Output | $9.00 | Non-global regions: $9.90 |
| Batch input | $0.75 | 50% discount for non-interactive batch workloads |
| Batch output | $4.50 | Same 50% discount on the batch tier |
Real-world workload math
Three concrete scenarios with actual numbers, not vibes.
Scenario A — solo developer doing 100K input / 50K output per day. Daily cost: (100K × $1.50/M) + (50K × $9.00/M) = $0.15 + $0.45 = $0.60. Monthly: ~$18. With heavy prompt caching on repeated system prompts (say 70% cache hit rate), monthly drops to ~$10.
Scenario B — team using Antigravity heavily, 2M input / 500K output per day. Daily cost: $3.00 + $4.50 = $7.50. Monthly: ~$225. With caching at 50%, ~$165. With batch for non-interactive work moved off the interactive tier, lower still.
Scenario C — agent loop running 10M input / 2M output per day in production. Daily cost: $15.00 + $18.00 = $33. Monthly: ~$990. With caching (essential at this volume), ~$700. The same workload on Claude Opus 4.8 at $5/$25 would be ~$50/day daily before caching, $1,500/month. Net: ~$800/month savings on the Gemini side. This is where the price-performance crossover compounds.
When Flash beats Opus on cost
The honest comparison:
- High-volume agent loops: Flash wins by ~3× on input, ~2.7× on output. At 10M+ tokens/day this matters.
- Heavy prompt caching workloads: Flash's $0.15/M cached input is 6.6× cheaper than Opus's $1.00/M cached input. Caching compounds the savings.
- Multimodal-heavy work: Flash handles audio/video/PDF natively at the standard input price. Opus charges premium rates for vision and doesn't support audio or video input at all.
- Batch-friendly workloads: Flash's batch tier at $0.75/$4.50 is one of the cheapest frontier-grade options available.
When Opus still wins on value (even at 3× the price)
- Code review and pair programming where honesty matters. Opus 4.8's 0% misreporting on flawed data and 4× fewer missed code flaws is worth the price premium when the cost of a wrong recommendation is high.
- One-shot critical work. If the model gets exactly one chance to be right, the extra reasoning depth pays for itself.
- Long-context retrieval past 200K tokens. Flash regresses here vs 3.1 Pro and isn't close to Opus 4.8 on long-context fidelity.
- Hard math and science reasoning. Opus 4.8 and 3.1 Pro both still beat 3.5 Flash on Humanity's Last Exam and ARC-AGI-2. Wait for 3.5 Pro if Gemini is your preferred ecosystem.
Section 12
The Numbers (Brief)
The benchmarks vibe coders actually care about for production work. Sources: Google's official model card on the Gemini 3.5 announcement, the Artificial Analysis intelligence index, and the independent benchmarks at OpenRouter.
| Benchmark | 3.5 Flash | 3.1 Pro | What it measures |
|---|---|---|---|
| Terminal-Bench 2.1 | 76.2% | 70.3% | Multi-tool CLI agent loops |
| MCP Atlas | 83.6% | 78.2% | Multi-tool coordination reliability |
| Finance Agent v2 | 57.9% | 43.0% | Finance reasoning + tool use |
| GDPval-AA | 1656 Elo | 1314 Elo | Real-world economic value tasks |
| CharXiv Reasoning | 84.2% | n/r | Multimodal visual reasoning |
| SWE-bench Verified | ~71% | ~73% | Software engineering on real GitHub issues |
| Humanity's Last Exam | n/r | (higher) | Hard expert-level reasoning — 3.1 Pro still wins |
| ARC-AGI-2 | n/r | 77.1% | Abstract reasoning — 3.1 Pro still wins |
| MRCR v2 (128K) | n/r | (higher) | Long-context retrieval — 3.1 Pro still wins |
Reading the table honestly
Flash beats Pro on coding, tool use, finance, and multimodal. Pro still wins on long-context retrieval, hard math/science reasoning, and abstract reasoning (ARC-AGI-2). The headline “Flash beats Pro” is true for the benchmarks most vibe coders care about. It is not true across the board. Plan accordingly.
Section 13
The Architect's Practice
The migration from Gemini 3.1 (Pro or Flash Preview) to 3.5 Flash isn't automatic. Five concrete changes you make, plus the two routing patterns that make the new model sustainable in a multi-model practice.
The five-step migration from 3.1 / 3 Flash Preview
Step 1: Update the model ID. From gemini-3-flash-preview (or gemini-3.1-pro) to gemini-3.5-flash. The model ID is stable GA — no preview suffix, no version pinning required.
Step 2: Replace thinking_budget with thinking_level. The old parameter is being removed. Map your existing budget values to the four named levels. If you were running at high budget, try medium first — that's the new default and usually enough.
Step 3: Remove temperature, top_p, and top_k from your config. Google's explicit guidance: don't change these on 3.5. The defaults are tuned. Setting them manually now hurts quality more than it helps.
Step 4: Update function-calling response matching. Every FunctionResponse part must include id and a name that matches the preceding call. Mismatched responses now fail loudly instead of being silently accepted. Audit your tool-use code paths.
Step 5: Test for default-effort changes. The default dropped from high to medium. Some workloads will get faster and cheaper at the new default. Others will need an explicit bump back to high. Run your eval suite before flipping the switch in production.
The two routing patterns for a multi-model practice
Cost-first routing (default daily driver): 3.5 Flash for everything. Escalate to Opus 4.8 or 3.5 Pro (when it ships) only for the specific tasks where Flash regresses — long-context retrieval, hard reasoning, code review where honesty matters.
Honesty-first routing (production-critical work): Opus 4.8 for code review, adversarial review, and one-shot critical decisions. 3.5 Flash for everything else — scaffolding, refactoring, parallel exploration, agent loops, multimodal work. The price differential makes the “Opus for review, Flash for build” pattern economically rational at any scale.
The two anti-patterns to avoid
Anti-pattern 1: Defaulting to high thinking_level on every call. The token cost compounds invisibly. Pick the level per task type.
Anti-pattern 2: Treating subagents as free. Each subagent has its own context window and consumes its own tokens. Five subagents running in parallel consume 5× the tokens of a single sequential agent. The wall-clock savings can be huge, but the dollar cost can also be huge. Budget accordingly.
Section 14
What's Next: 3.5 Pro Coming June 2026
The most consequential thing Google didn't ship at I/O 2026 is the one most worth tracking. Gemini 3.5 Pro was the headline announcement and is delayed to June. Pichai's stage line was “Give us until next month to get it to you,” and the audience reportedly groaned. Google's blog confirms 3.5 Pro is in internal use now.
What we know (limited)
- Confirmed: 3.5 Pro is in internal use at Google and rolling out next month.
- Confirmed: It's being designed to close the gaps 3.5 Flash regressed on — long-context retrieval, hard reasoning, ARC-AGI-2-class abstract reasoning.
- Not confirmed: pricing, benchmarks, model card. Google has shared none of this publicly.
- Reasonable expectation: 3.5 Pro will close most of the Flash regressions and likely take the Artificial Analysis Intelligence Index lead on launch, the way 3.1 Pro did in February.
What this means for your practice today
Don't rebuild for 3.5 Pro yet. Build on 3.5 Flash with the routing pattern that keeps 3.1 Pro available for the regression cases. When 3.5 Pro ships, the swap will be cheap because the API is consistent across the 3.5 family.
Watch the pricing. The single biggest unknown is whether 3.5 Pro will be priced higher than 3.1 Pro (likely, given the capability claims) or held flat (possible, if Google is competing harder on cost-per-intelligence). If it ships at $2.00/$12.00 like 3.1 Pro, the cost-quality argument for the Flash + Pro split gets stronger. If it ships at $3.00/$15.00, more workloads stay on Flash.
The next masterclass. When 3.5 Pro ships, this masterclass gets a sibling. The pattern continues: Google ships a frontier model, the DDS Vibe Academy ships the vibe coder's guide. The 5-block intent recipe stays the same. The model gets stronger. Your practice gets sharper.
The honest closing thought
Gemini 3.5 Flash is the most consequential consumer AI launch of 2026 so far — not because it's the smartest model on every benchmark, but because it's the model that's already running for the most people. The Gemini app and Search AI Mode put 3.5 Flash in front of nearly a billion monthly active users. Most of those users will never know the model name. They'll just know that Gemini got faster, more useful, and more reliable in May 2026. That's the real story. Spark is the future. Antigravity is the daily driver. The 5-block intent recipe is the practice. Build accordingly.
Section 15
Frequently Asked Questions
What is Gemini 3.5 Flash?
Gemini 3.5 Flash is Google DeepMind's newest Flash-tier AI model, released generally available on May 19, 2026 at Google I/O. The API model ID is gemini-3.5-flash. It features a 1 million token context window, multimodal input (text, image, audio, video, PDF), text output up to 65,536 tokens, dynamic thinking on by default with a new thinking_level parameter, and pricing of $1.50 per million input tokens and $9.00 per million output tokens. Google claims it outperforms Gemini 3.1 Pro on most coding and agentic benchmarks while running about 4 times faster.
Where does Gemini 3.5 Flash run?
Everywhere Google sells Gemini. Consumer: the Gemini app and AI Mode in Google Search (default model globally). Developer: Gemini API in Google AI Studio, Android Studio, Google Antigravity 2.0 IDE, and Vertex AI. Enterprise: Gemini Enterprise, Gemini Enterprise Agent Platform, and Google Workspace (powering AI features in Gmail, Docs, Sheets, Slides, Keep). Partner integrations announced at launch include Shopify, Macquarie Bank, Salesforce Agentforce, Ramp, Xero, and Databricks.
What is Gemini Spark?
Gemini Spark is Google's 24/7 persistent personal AI agent, powered by Gemini 3.5 Flash and the Antigravity harness. Unlike a chatbot you open and close, Spark runs on dedicated virtual machines on Google Cloud and keeps working in the background even when you close your laptop. It connects natively to Gmail, Calendar, Drive, Docs, Sheets, Slides, YouTube, and Google Maps (off by default; you turn each one on in settings). Spark is in beta to Google AI Ultra subscribers in the US ($100/month tier).
What are the thinking levels in Gemini 3.5 Flash?
Gemini 3.5 Flash introduces a thinking_level parameter that replaces the older thinking_budget approach. Four levels: minimal (fastest, lowest cost), low, medium (default), and high (deepest reasoning, slowest, most tokens). Important migration note: the default was lowered from high (in Gemini 3 Flash Preview) to medium in 3.5 Flash, so workloads carried over from preview may need adjustment. Google's guidance: lower thinking levels reduce unnecessary tool calls in agentic loops.
Is Gemini 3.5 Flash free to use?
It is free in the Gemini app and AI Mode in Google Search globally (the consumer surfaces). For developers and enterprises via the API, pricing is $1.50 per million input tokens and $9.00 per million output tokens, with a 90% discount on cached input ($0.15 per million). Gemini Spark requires the Google AI Ultra plan at $100 per month. Free tier API access is available through Google AI Studio for testing and prototyping.
Did Google deliver what they promised at I/O 2026?
Partially. The headline announcement was supposed to be Gemini 3.5 Pro, which was delayed to June 2026. Sundar Pichai's exact stage line was “Give us until next month to get it to you,” and the audience reportedly groaned. What did ship is 3.5 Flash, which Google says outperforms the previous-generation 3.1 Pro on most coding and agentic benchmarks. It is a real capability shift, but it is not the model that was promised. The masterclass is honest about this gap.
Should I use Gemini 3.5 Flash or Claude Opus 4.8 for coding?
Different tools for different work. Gemini 3.5 Flash wins on price (about 3 times cheaper than Opus 4.8), output speed (Google claims 4 times faster), and integration with the Google ecosystem (Workspace, Search, Android). Claude Opus 4.8 wins on honesty improvements (0% misreporting on flawed data, 4 times fewer missed code flaws), the Dynamic Workflows feature, and code-review quality. For high-volume agent loops at scale, the cost math favors Flash. For one-shot critical work, Opus 4.8's honesty matters more.
What is Antigravity 2.0?
Antigravity is Google's agent-first desktop IDE, updated to version 2.0 at I/O 2026. It runs Gemini 3.5 Flash natively (up to 12 times faster than other surfaces, per Google) and adds: artifacts (versioned outputs you can iterate on), multi-agent orchestration (deploy parallel subagents), one-click export from AI Studio, native Android Studio integration, and the new Managed Agents API for hosted Linux sandboxes. It is free to download. Distinct from Anthropic's Cowork or Claude Code — this is Google's answer to the agent-IDE category.
What is the 5-block intent template?
A prompting recipe that works across the Gemini app, AI Studio, Antigravity, and the Gemini API. Five blocks, in order: state the outcome, name the constraints, point to context, declare the test, set the verification gate. The recipe matches how Gemini 3.5 Flash reasons: it orients around the goal first, then attends to constraints, then to context, then verifies against the test, then surfaces uncertainty at the gate. Every paste-ready prompt in this masterclass follows the template. The same recipe works for Claude Opus 4.8, GPT-5.5, and any other frontier model.
How does Gemini 3.5 Flash compare to Gemini 3.1 Pro?
3.5 Flash beats 3.1 Pro on Terminal-Bench 2.1 (76.2% vs 70.3%), MCP Atlas (83.6% vs 78.2%), Finance Agent v2 (57.9% vs 43.0%), GDPval-AA Elo (1656 vs 1314), and CharXiv multimodal reasoning (84.2%). 3.1 Pro still wins on Humanity's Last Exam, ARC-AGI-2, and MRCR v2 128K long-context retrieval. Practical rule: use 3.5 Flash for coding, tool use, finance, and multimodal. Stay on 3.1 Pro for hard math and science reasoning and long-context analysis past 200K tokens until 3.5 Pro ships in June.
What Workspace apps does Gemini 3.5 Flash power?
Gmail (smart reply, summarization, draft generation, search), Docs (writing assistance, citations, formatting), Sheets (formula generation, data analysis, chart suggestions), Slides (deck generation, layout suggestions, image generation via Google Pics), Keep (voice notes, smart organization), Drive (file search, content extraction), Calendar (scheduling agents via Spark), and Meet (transcription, summarization). 3.5 Flash is the default model behind most Workspace AI features as of late May 2026.
What is the Managed Agents API?
Managed Agents is a new Gemini API feature launched in public preview at I/O 2026. A single API call gives you an agent plus a hosted Linux sandbox environment that supports Bash, Python, and Node.js, with file management, web browsing, custom markdown-defined skills, and the ability to mount Git repos or GCS buckets. The launch includes the Antigravity Agent (antigravity-preview-05-2026), a general-purpose managed agent that can autonomously plan, reason, write and execute code, manage files, and browse the web inside its sandbox container.
What is the single most important thing to do with Gemini 3.5 Flash?
Pick the right thinking_level deliberately, per task. The default dropped from high in preview to medium in GA, which means routine work is faster and cheaper but hard reasoning may need an explicit bump back to high. For coding loops and agentic work where the cost of a wrong tool call is high, set thinking_level to high. For drafting, summarizing, and routine analysis, medium or low is fine. For high-volume background processing where speed matters more than depth, minimal saves tokens dramatically.
Where can I learn more about Gemini 3.5 Flash?
Google's official announcement is at blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/. The developer documentation lives at ai.google.dev/gemini-api/docs, with the specific 3.5 migration guide at ai.google.dev/gemini-api/docs/interactions/whats-new-gemini-3.5. Google Cloud's enterprise rollout post is at cloud.google.com/blog. The Gemini Spark product page is at gemini.google/overview/agent/spark/. This masterclass is sourced from those Google primary sources plus a dozen independent technical analyses published in the two weeks after launch.
