Specifications

Model ID gemini-3-pro-preview
Provider Google
Architecture transformer
Context Window 1M tokens
Max Input 1M tokens
Max Output 64K tokens
Knowledge Cutoff 2025-01-31
License proprietary
Open Weights No

Capabilities

Modalities

Input: text, images, video, audio, code, pdf
Output: text, images

Reasoning

Reasoning Model Agentic

Features

function-callingstreaming-function-callingmultimodal-function-responsesjson-modestructured-outputsvisionvideo-understandingaudio-understandingstreamingparallel-tool-callinggoogle-search-groundingcode-executionthought-signaturesdynamic-thinking

Variants

VARIANT API ID DESCRIPTION SWE
Gemini 3 Pro gemini-3-pro-preview Flagship reasoning model with dynamic thinking
Gemini 3 Pro Image (Nano Banana Pro) gemini-3-pro-image-preview Native 4K image generation with advanced text rendering
Gemini 3 Deep Think Consumer gemini-3-deep-think Extended parallel hypothesis exploration for complex reasoning

API Pricing

Input $2 per 1M tokens
Output $12 per 1M tokens
Cached Input $0.2 per 1M tokens
Batch Input $1 per 1M tokens
Batch Output $6 per 1M tokens

Pricing doubles for prompts exceeding 200K tokens. No free API tier—only paid access. Batch API offers 50% discount with 24-hour turnaround. Google Search grounding billing starts January 5, 2026.

Gemini Access

TIER PRICE CONTEXT RATE LIMIT
Free Free Limited
Google AI Pro $19.99/mo Expanded limits
Google AI Ultra $249.99/mo Highest limits

Benchmarks

Coding

SWE-bench Verified 76.2%
TerminalBench 2 54.2%

Reasoning

GPQA Diamond 91.9%
MMLU 91.8%
MMLU-Pro 81%
ARC-AGI-2 31.1%

Math

AIME 2025 95%

Rankings

Artificial Analysis #1
LMSYS Arena ELO 1501

First model to break 1500 LMArena Elo. ARC-AGI-2 scores represent breakthrough in abstract reasoning—45.1% with Deep Think is nearly 3x GPT-5.1's 17.6%. MathArena Apex at 23.4% is >20x improvement over all competitors. SWE-bench trails Claude Opus 4.5 by 4.7pp. 88% hallucination rate when incorrect remains a concern.

Gemini 3 Pro is Google’s first undisputed claim to the frontier AI crown, released November 18, 2025. The model achieved an unprecedented 1501 Elo on LMArena—the first to break the 1500 barrier—while delivering breakthrough performance on abstract reasoning that leaves competitors in the dust. At 45.1% on ARC-AGI-2 (with Deep Think), it nearly triples GPT-5.1’s 17.6% and represents a genuine leap in AI reasoning capability.

The model maintains Google’s long-context advantage with a 1 million token input window and introduces native multimodal capabilities unmatched by competitors—including video understanding and screen comprehension where it scores 72.7% versus GPT-5.1’s 3.5%. However, an 88% hallucination rate when incorrect and a coding gap behind Claude Opus 4.5 (76.2% vs 80.9% on SWE-bench) temper the enthusiasm. At $2-4 per million input tokens, it commands a premium but delivers exceptional value for tasks requiring its reasoning and multimodal strengths.

Quick specs

ProviderGoogle
ReleasedNovember 18, 2025
Context window1M tokens input / 64K output
Knowledge cutoffJanuary 2025
Input price$2.00 / MTok (≤200K) / $4.00 / MTok (>200K)
Output price$12.00 / MTok (≤200K) / $18.00 / MTok (>200K)
Cached input$0.20 / MTok (90% discount)
SWE-bench Verified76.2%
GPQA Diamond91.9%
ARC-AGI-231.1% (45.1% with Deep Think)
LMArena Elo1501 (#1)
Best forAbstract reasoning, multimodal tasks, long-context processing, web development
LimitationsCoding trails Claude, 88% hallucination rate when wrong

TRY GEMINI 3 PRO →

What’s new in Gemini 3 Pro

Gemini 3 Pro represents a generational leap rather than incremental improvement. The gains from Gemini 2.5 Pro are substantial across every benchmark category—MathArena Apex improved by 4,580%, ARC-AGI-2 by 535%, and SWE-bench by 28%.

Dynamic thinking architecture

The model introduces dynamic thinking through a new thinking_level parameter that controls reasoning depth. Three levels are available: low (minimises latency and cost), medium (currently unsupported), and high (default, maximises reasoning capability). Unlike GPT-5.1’s binary thinking/instant split, Gemini 3 Pro applies reasoning adaptively based on query complexity.

Thought signatures represent a significant innovation—encrypted reasoning context preserved across API calls. This enables maintaining complex reasoning chains in multi-turn conversations without re-processing the entire context, dramatically improving efficiency for agentic workflows.

Native multimodal processing

Gemini 3 Pro processes text, images, video, audio, code, and PDFs as first-class inputs. The ScreenSpot-Pro benchmark demonstrates the multimodal advantage starkly: 72.7% versus GPT-5.1’s 3.5%—a 20x performance gap in screen understanding tasks. Video-MMMU scores 87.6%, with no competitor within 7 percentage points.

Media token consumption varies by resolution: images consume 280-1,120 tokens depending on detail level, while video frames cost approximately 70 tokens each at standard resolution.

Advanced tool capabilities

Two new tool features target agentic workflows. Streaming function calling streams partial function call arguments as they’re generated, enabling faster response times for complex tool chains. Multimodal function responses allow functions to return images and PDFs directly, enabling sophisticated document processing pipelines.

Google Search grounding provides real-time information access beyond the January 2025 knowledge cutoff. Billing for this feature begins January 5, 2026.

The Gemini 3 model family

Google released Gemini 3 Pro as the flagship model, with additional variants for specialised use cases:

VariantAPI IdentifierPurposeNotes
Gemini 3 Progemini-3-pro-previewFlagship reasoning modelPreview status, GA expected Q1 2026
Gemini 3 Pro Imagegemini-3-pro-image-previewNative 4K image generationAlso called “Nano Banana Pro”
Gemini 3 Deep ThinkExtended parallel reasoningAI Ultra subscribers only

Gemini 3 Deep Think deserves special attention. Released December 4, 2025, exclusively to Google AI Ultra subscribers ($249.99/month), it enables extended parallel hypothesis exploration for complex reasoning tasks. On ARC-AGI-2, Deep Think pushes performance from 31.1% to 45.1%—a 45% improvement that represents the model’s ceiling on abstract reasoning. Google describes it as exploring “multiple reasoning paths simultaneously before converging on the best solution.”

Per official documentation: “We plan to release additional models to the Gemini 3 series soon.”

Benchmark performance

Gemini 3 Pro achieves state-of-the-art results on reasoning and multimodal benchmarks while trailing Claude Opus 4.5 on real-world coding tasks. The breakthrough performances on abstract reasoning and mathematics represent genuine advances rather than incremental gains.

Reasoning benchmarks

BenchmarkGemini 3 ProGPT-5.1Claude Opus 4.5Notes
LMArena Elo1501 (#1)~1480~1485First model to break 1500
GPQA Diamond91.9%87.0%87.0%Surpasses human experts (89.8%)
Humanity’s Last Exam37.5%26.5%~14%New benchmark for frontier models
ARC-AGI-231.1% (45.1% DT)17.6%37.6%Deep Think nearly 3x GPT-5.1
MMLU91.8%91.5%87.4%Saturated benchmark

The ARC-AGI-2 result is the headline number. This benchmark tests abstract reasoning ability on novel visual puzzles—the kind of pattern recognition that was thought to require human-like generalisation. Gemini 3 Pro’s 45.1% with Deep Think represents a nearly 3x improvement over GPT-5.1’s 17.6%, and even the base 31.1% score substantially outperforms most competitors.

Coding benchmarks

BenchmarkGemini 3 ProGPT-5.1Claude Opus 4.5Notes
SWE-bench Verified76.2%76.3%80.9%Claude leads by 4.7pp
Terminal-Bench 2.054.2%47.6%59.3%Claude leads
WebDev Arena Elo1487 (#1)Best for frontend/UI
LiveCodeBench Elo2,4392,243~2,100Real-time coding

SWE-bench Verified tests models on real GitHub pull requests. Gemini 3 Pro’s 76.2% essentially ties GPT-5.1’s 76.3% but trails Claude Opus 4.5’s industry-leading 80.9% by a meaningful 4.7 percentage points. However, on WebDev Arena—which tests frontend and UI development specifically—Gemini 3 Pro leads at 1487 Elo. Developers report reaching for Gemini “for UI/frontend tasks” where it excels.

Mathematics benchmarks

BenchmarkGemini 3 ProGPT-5.1Claude Opus 4.5Notes
AIME 202595%94%87%Competition math
AIME 2025 (with code)100%100%100%All models achieve perfect
MathArena Apex23.4%1.0%1.6%>20x improvement over competitors

The MathArena Apex score of 23.4%—versus under 2% for all competitors—represents a genuine breakthrough in mathematical reasoning. This is a >20x improvement over both GPT-5.1 (1.0%) and Gemini 2.5 Pro (0.5%).

Multimodal benchmarks

BenchmarkGemini 3 ProGPT-5.1ClaudeNotes
MMMU-Pro81%76%68%Visual reasoning
Video-MMMU87.6%80.4%77.8%Video understanding
ScreenSpot-Pro72.7%3.5%20x advantage in screen understanding
SimpleQA Verified72.1%34.9%29.3%Factual accuracy

The multimodal results are where Gemini 3 Pro dominates decisively. ScreenSpot-Pro measures the ability to understand and interact with desktop and mobile interfaces—Gemini’s 72.7% versus GPT-5.1’s 3.5% represents a 20x performance gap. This makes Gemini 3 Pro the clear choice for computer use and screen automation tasks.

Pricing breakdown

Gemini 3 Pro commands a 20-60% premium over Gemini 2.5 Pro, with pricing that doubles for prompts exceeding 200,000 tokens.

TierInput (per MTok)Output (per MTok)
Standard (≤200K)$2.00$12.00
Long context (>200K)$4.00$18.00
Cached input (≤200K)$0.20
Cached input (>200K)$0.40
Cache storage$4.50/hour
Batch API (≤200K)$1.00$6.00
Batch API (>200K)$2.00$9.00

Cost comparison with competitors

ModelInputOutputContextNotes
Gemini 3 Pro$2-4$12-181MBest for long-context reasoning
Claude Opus 4.5$5.00$25.00200KBest coding quality
Claude Sonnet 4.5$3.00$15.00200KBest coding value
GPT-5.1$1.25$10.00400KCheapest frontier model

Gemini 3 Pro offers the best price-performance for tasks requiring large context windows. Processing a 500K token document costs approximately $2.00 with Gemini versus being impossible with Claude (200K limit) or expensive with GPT-5.1 (degraded quality beyond 120K tokens).

No free API tier

Unlike previous Gemini models, gemini-3-pro-preview has no free API tier—only paid access is available. However, Google AI Studio offers free interactive sessions with rate limits of 10-50 requests per minute for experimentation.

How to access Gemini 3 Pro

Via API

Gemini 3 Pro is available through both Google AI Studio and Vertex AI with no waitlist. Basic usage:

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")

response = model.generate_content(
    "Your prompt here",
    generation_config={
        "thinking_level": "high"  # low, medium (unsupported), high
    }
)

Enable context caching for repeated prompts:

from google.generativeai import caching

cache = caching.CachedContent.create(
    model="gemini-3-pro-preview",
    contents=["Your system prompt or reference content"],
    ttl=datetime.timedelta(hours=1)
)

model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Your query")

Rate limits scale with usage tier: 10 requests/minute on free tier, 50+ requests/minute on paid tiers. Vertex AI offers provisioned throughput options from $1,200/week to $2,000/month (annual commitment).

Via Gemini app

Access varies by subscription tier:

TierPriceGemini 3 ProDeep ThinkKey features
Free$0✓ (limited)Via “Thinking” dropdown
Google AI Pro$19.99/mo✓ (expanded)+ Deep Research, Nano Banana Pro, Workspace integration
Google AI Ultra$249.99/mo✓ (highest)+ Deep Think, Project Mariner, 30TB storage, YouTube Premium

Google AI Ultra is the only tier with access to Gemini 3 Deep Think, the extended reasoning mode that achieves 45.1% on ARC-AGI-2. Currently offered at 50% off for the first 3 months ($124.99/month).

AI Pro is available in 150+ countries; AI Ultra in 140+ countries.

How Gemini 3 Pro compares

vs Claude Opus 4.5

Claude leads on raw coding capability (80.9% vs 76.2% SWE-bench) and sustained autonomous engineering work. Multiple developers report Claude maintains better coherence in 30+ minute coding sessions. However, Gemini dominates on abstract reasoning (45.1% vs 37.6% ARC-AGI-2 with Deep Think) and offers 5x larger context windows.

Choose Claude for: complex refactoring, multi-file codebases, sustained autonomous coding, when code quality matters more than cost.

Choose Gemini for: abstract reasoning tasks, massive document analysis, UI/frontend development, multimodal workflows, computer use automation.

vs GPT-5.1

Gemini substantially outperforms GPT-5.1 on abstract reasoning (45.1% vs 17.6% ARC-AGI-2—nearly 3x better) and multimodal tasks (72.7% vs 3.5% on ScreenSpot-Pro—20x better). GPT-5.1 is 40% cheaper per input token and has better ecosystem integration.

Choose GPT-5.1 for: cost-sensitive production deployments, rapid iteration, ChatGPT ecosystem, when “good enough” ships faster.

Choose Gemini for: abstract reasoning, mathematical problems, screen understanding, video/audio processing, 1M token context needs.

The practical consensus

From community feedback: developers use Gemini 3 Pro for abstract reasoning, UI/frontend tasks, and long-context processing; Claude for complex coding work; and GPT-5.1 for speed and cost efficiency. Ben Tayloser from Creative Strategies noted after 5 days of testing: “Gemini 3 Pro was able to get to that 80% mark, other coding agents like Claude Code or Codex get 50-60% of the way there, then require a bunch of follow up.”

Known limitations

Independent testing and community reports reveal several significant weaknesses:

Hallucination behaviour is the primary concern. The AA-Omniscience benchmark reveals an 88% hallucination rate when the model is incorrect—matching Gemini 2.5’s rate despite accuracy improvements. The model achieves highest accuracy at 53% but remains overconfident when wrong, preferring fabrication over “I don’t know.” Zvi Mowshowitz summarised: “Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.”

Context degradation manifests at 300K+ tokens despite the 1M theoretical limit. Some users report issues with codebases under 10,000 lines. Multiple reports indicate the model “flakes out” with complex existing codebases.

Temporal confusion emerged as an early bug. Andrej Karpathy famously encountered the model refusing to believe the current date—when shown evidence, “the LLM accused Karpathy of gaslighting it—of uploading AI-generated fakes.”

Benchmark optimisation concerns persist. The model doesn’t know its own identity or version number and was trained on benchmark data (including BIG-bench canary strings). One user noted: “It feels oddly benchmarkmaxed. You can definitely feel the higher hallucination rate vs GPT.”

Content filtering is described as “tighter than expected,” refusing image generations with public figures and requiring rewording for some non-controversial historical questions.

Community reception

Overall sentiment runs approximately 75% positive, 25% negative, with praise focused on benchmark achievements and criticism centred on hallucination tendencies.

The positives

Enterprise partners report concrete productivity gains:

  • Equifax trial: 97% of 1,500 employees requested to keep licences; 90% reported measurable productivity gains averaging 1+ hours saved daily
  • Pinnacol Assurance: 96% time savings, 90% satisfaction
  • AdVon Commerce: Processed 93,673-product catalogue in under a month (previously took a full year)

Developer tool makers praise the model’s agentic capabilities:

  • Zach Lloyd, CEO of Warp: “Gemini 3 shows significant gains in reasoning, reliability in multi-step agent workflows, and an ability to debug tough development tasks with high-quality fixes. In early evaluations, it improved Warp’s Terminal Bench state of the art score by 20%.”
  • JetBrains: Reported “more than a 50% improvement over Gemini 2.5 Pro in the number of solved benchmark tasks”

The negatives

The LessWrong and AI safety community has been sharply critical. Zvi Mowshowitz’s review titled “Gemini 3 Pro Is a Vast Intelligence With No Spine” captured the concern: the model’s overconfidence when wrong undermines its utility for critical applications.

A persistent criticism describes the model as “benchmarkmaxxed”—optimised for evaluation performance while exhibiting quality inconsistency in real-world scenarios. API reliability issues include CLI crashes reported “at least half the time” by some users, though speed when functional is praised.

The verdict

The consensus: Gemini 3 Pro is the best model for abstract reasoning and multimodal tasks, but its hallucination behaviour requires verification workflows. Claude remains the choice for coding work where correctness matters.

Version history

VersionReleasedKey changes
Gemini 3 Deep ThinkDec 4, 2025Extended reasoning for AI Ultra subscribers
Gemini 3 ProNov 18, 2025First 1500+ LMArena Elo, dynamic thinking, 45.1% ARC-AGI-2
Gemini 2.5 ProMar 2025#1 on LMArena, adaptive thinking
Gemini 2.5 FlashMar 2025Fast, cost-efficient reasoning
Gemini 2.0 FlashDec 2024Agentic foundations
Gemini 1.5 ProFeb 20241M context window

Improvements from Gemini 2.5 Pro

The performance gains from Gemini 2.5 Pro are substantial:

BenchmarkGemini 3 ProGemini 2.5 ProImprovement
LMArena Elo15011451+50 points
SWE-bench Verified76.2%59.6%+28%
Humanity’s Last Exam37.5%21.6%+74%
GPQA Diamond91.9%86.4%+6.4%
MMMU-Pro81%68%+19%
Terminal-Bench 2.054.2%32.6%+66%
MathArena Apex23.4%0.5%+4,580%
ARC-AGI-231.1%4.9%+535%

FAQ

Is Gemini 3 Pro better than Claude Opus 4.5?

For coding, no—Claude leads with 80.9% vs 76.2% on SWE-bench. For abstract reasoning, yes—Gemini achieves 45.1% vs 37.6% on ARC-AGI-2. Choose based on your primary use case: Claude for coding, Gemini for reasoning and multimodal tasks.

How much does Gemini 3 Pro cost?

$2.00 per million input tokens (doubling to $4.00 for prompts exceeding 200K tokens), $12.00-18.00 per million output tokens. Cached inputs receive a 90% discount at $0.20/MTok. A typical conversation costs $0.02-0.05.

Can I use Gemini 3 Pro for free?

Limited free access is available through the Gemini app via the “Thinking” mode dropdown. API access requires payment—there is no free API tier for gemini-3-pro-preview.

What’s the difference between Gemini 3 Pro and Deep Think?

Deep Think is an extended reasoning mode that explores multiple reasoning paths simultaneously before converging. It improves ARC-AGI-2 performance from 31.1% to 45.1%—a 45% improvement—but is only available to Google AI Ultra subscribers ($249.99/month).

Should I upgrade from Gemini 2.5 Pro?

If you need frontier reasoning capability, yes—the improvements are substantial (28% better on SWE-bench, 535% better on ARC-AGI-2). If cost is a concern, note that pricing increased 20-60% from 2.5 Pro.

When will Gemini 3 Pro reach general availability?

Google expects GA in Q1 2026. The current gemini-3-pro-preview designation indicates preview status with potential API changes before stable release.

Does Gemini 3 Pro have a hallucination problem?

Yes. Independent testing shows an 88% hallucination rate when the model is incorrect—it prefers fabrication over admitting uncertainty. Always verify critical outputs, especially for factual claims outside its training data.

guest@theairankings:~$_