Gemini 3 Pro
Specifications
| Model ID | gemini-3-pro-preview |
| Provider | |
| Architecture | transformer |
| Context Window | 1M tokens |
| Max Input | 1M tokens |
| Max Output | 64K tokens |
| Knowledge Cutoff | 2025-01-31 |
| License | proprietary |
| Open Weights | No |
Capabilities
Modalities
Reasoning
Features
Variants
| VARIANT | API ID |
|---|---|
| Gemini 3 Pro | gemini-3-pro-preview |
| Gemini 3 Pro Image (Nano Banana Pro) | gemini-3-pro-image-preview |
| Gemini 3 Deep Think Consumer | gemini-3-deep-think |
API Pricing
Pricing doubles for prompts exceeding 200K tokens. No free API tier—only paid access. Batch API offers 50% discount with 24-hour turnaround. Google Search grounding billing starts January 5, 2026.
Gemini Access
| TIER | PRICE |
|---|---|
| Free | Free |
| Google AI Pro | $19.99/mo |
| Google AI Ultra | $249.99/mo |
Benchmarks
Coding
| SWE-bench Verified | 76.2% |
| TerminalBench 2 | 54.2% |
Reasoning
| GPQA Diamond | 91.9% |
| MMLU | 91.8% |
| MMLU-Pro | 81% |
| ARC-AGI-2 | 31.1% |
Math
| AIME 2025 | 95% |
Rankings
| Artificial Analysis | #1 |
| LMSYS Arena ELO | 1501 |
First model to break 1500 LMArena Elo. ARC-AGI-2 scores represent breakthrough in abstract reasoning—45.1% with Deep Think is nearly 3x GPT-5.1's 17.6%. MathArena Apex at 23.4% is >20x improvement over all competitors. SWE-bench trails Claude Opus 4.5 by 4.7pp. 88% hallucination rate when incorrect remains a concern.
Gemini 3 Pro is Google’s first undisputed claim to the frontier AI crown, released November 18, 2025. The model achieved an unprecedented 1501 Elo on LMArena—the first to break the 1500 barrier—while delivering breakthrough performance on abstract reasoning that leaves competitors in the dust. At 45.1% on ARC-AGI-2 (with Deep Think), it nearly triples GPT-5.1’s 17.6% and represents a genuine leap in AI reasoning capability.
The model maintains Google’s long-context advantage with a 1 million token input window and introduces native multimodal capabilities unmatched by competitors—including video understanding and screen comprehension where it scores 72.7% versus GPT-5.1’s 3.5%. However, an 88% hallucination rate when incorrect and a coding gap behind Claude Opus 4.5 (76.2% vs 80.9% on SWE-bench) temper the enthusiasm. At $2-4 per million input tokens, it commands a premium but delivers exceptional value for tasks requiring its reasoning and multimodal strengths.
Quick specs
| Provider | |
| Released | November 18, 2025 |
| Context window | 1M tokens input / 64K output |
| Knowledge cutoff | January 2025 |
| Input price | $2.00 / MTok (≤200K) / $4.00 / MTok (>200K) |
| Output price | $12.00 / MTok (≤200K) / $18.00 / MTok (>200K) |
| Cached input | $0.20 / MTok (90% discount) |
| SWE-bench Verified | 76.2% |
| GPQA Diamond | 91.9% |
| ARC-AGI-2 | 31.1% (45.1% with Deep Think) |
| LMArena Elo | 1501 (#1) |
| Best for | Abstract reasoning, multimodal tasks, long-context processing, web development |
| Limitations | Coding trails Claude, 88% hallucination rate when wrong |
What’s new in Gemini 3 Pro
Gemini 3 Pro represents a generational leap rather than incremental improvement. The gains from Gemini 2.5 Pro are substantial across every benchmark category—MathArena Apex improved by 4,580%, ARC-AGI-2 by 535%, and SWE-bench by 28%.
Dynamic thinking architecture
The model introduces dynamic thinking through a new thinking_level parameter that controls reasoning depth. Three levels are available: low (minimises latency and cost), medium (currently unsupported), and high (default, maximises reasoning capability). Unlike GPT-5.1’s binary thinking/instant split, Gemini 3 Pro applies reasoning adaptively based on query complexity.
Thought signatures represent a significant innovation—encrypted reasoning context preserved across API calls. This enables maintaining complex reasoning chains in multi-turn conversations without re-processing the entire context, dramatically improving efficiency for agentic workflows.
Native multimodal processing
Gemini 3 Pro processes text, images, video, audio, code, and PDFs as first-class inputs. The ScreenSpot-Pro benchmark demonstrates the multimodal advantage starkly: 72.7% versus GPT-5.1’s 3.5%—a 20x performance gap in screen understanding tasks. Video-MMMU scores 87.6%, with no competitor within 7 percentage points.
Media token consumption varies by resolution: images consume 280-1,120 tokens depending on detail level, while video frames cost approximately 70 tokens each at standard resolution.
Advanced tool capabilities
Two new tool features target agentic workflows. Streaming function calling streams partial function call arguments as they’re generated, enabling faster response times for complex tool chains. Multimodal function responses allow functions to return images and PDFs directly, enabling sophisticated document processing pipelines.
Google Search grounding provides real-time information access beyond the January 2025 knowledge cutoff. Billing for this feature begins January 5, 2026.
The Gemini 3 model family
Google released Gemini 3 Pro as the flagship model, with additional variants for specialised use cases:
| Variant | API Identifier | Purpose | Notes |
|---|---|---|---|
| Gemini 3 Pro | gemini-3-pro-preview | Flagship reasoning model | Preview status, GA expected Q1 2026 |
| Gemini 3 Pro Image | gemini-3-pro-image-preview | Native 4K image generation | Also called “Nano Banana Pro” |
| Gemini 3 Deep Think | — | Extended parallel reasoning | AI Ultra subscribers only |
Gemini 3 Deep Think deserves special attention. Released December 4, 2025, exclusively to Google AI Ultra subscribers ($249.99/month), it enables extended parallel hypothesis exploration for complex reasoning tasks. On ARC-AGI-2, Deep Think pushes performance from 31.1% to 45.1%—a 45% improvement that represents the model’s ceiling on abstract reasoning. Google describes it as exploring “multiple reasoning paths simultaneously before converging on the best solution.”
Per official documentation: “We plan to release additional models to the Gemini 3 series soon.”
Benchmark performance
Gemini 3 Pro achieves state-of-the-art results on reasoning and multimodal benchmarks while trailing Claude Opus 4.5 on real-world coding tasks. The breakthrough performances on abstract reasoning and mathematics represent genuine advances rather than incremental gains.
Reasoning benchmarks
| Benchmark | Gemini 3 Pro | GPT-5.1 | Claude Opus 4.5 | Notes |
|---|---|---|---|---|
| LMArena Elo | 1501 (#1) | ~1480 | ~1485 | First model to break 1500 |
| GPQA Diamond | 91.9% | 87.0% | 87.0% | Surpasses human experts (89.8%) |
| Humanity’s Last Exam | 37.5% | 26.5% | ~14% | New benchmark for frontier models |
| ARC-AGI-2 | 31.1% (45.1% DT) | 17.6% | 37.6% | Deep Think nearly 3x GPT-5.1 |
| MMLU | 91.8% | 91.5% | 87.4% | Saturated benchmark |
The ARC-AGI-2 result is the headline number. This benchmark tests abstract reasoning ability on novel visual puzzles—the kind of pattern recognition that was thought to require human-like generalisation. Gemini 3 Pro’s 45.1% with Deep Think represents a nearly 3x improvement over GPT-5.1’s 17.6%, and even the base 31.1% score substantially outperforms most competitors.
Coding benchmarks
| Benchmark | Gemini 3 Pro | GPT-5.1 | Claude Opus 4.5 | Notes |
|---|---|---|---|---|
| SWE-bench Verified | 76.2% | 76.3% | 80.9% | Claude leads by 4.7pp |
| Terminal-Bench 2.0 | 54.2% | 47.6% | 59.3% | Claude leads |
| WebDev Arena Elo | 1487 (#1) | — | — | Best for frontend/UI |
| LiveCodeBench Elo | 2,439 | 2,243 | ~2,100 | Real-time coding |
SWE-bench Verified tests models on real GitHub pull requests. Gemini 3 Pro’s 76.2% essentially ties GPT-5.1’s 76.3% but trails Claude Opus 4.5’s industry-leading 80.9% by a meaningful 4.7 percentage points. However, on WebDev Arena—which tests frontend and UI development specifically—Gemini 3 Pro leads at 1487 Elo. Developers report reaching for Gemini “for UI/frontend tasks” where it excels.
Mathematics benchmarks
| Benchmark | Gemini 3 Pro | GPT-5.1 | Claude Opus 4.5 | Notes |
|---|---|---|---|---|
| AIME 2025 | 95% | 94% | 87% | Competition math |
| AIME 2025 (with code) | 100% | 100% | 100% | All models achieve perfect |
| MathArena Apex | 23.4% | 1.0% | 1.6% | >20x improvement over competitors |
The MathArena Apex score of 23.4%—versus under 2% for all competitors—represents a genuine breakthrough in mathematical reasoning. This is a >20x improvement over both GPT-5.1 (1.0%) and Gemini 2.5 Pro (0.5%).
Multimodal benchmarks
| Benchmark | Gemini 3 Pro | GPT-5.1 | Claude | Notes |
|---|---|---|---|---|
| MMMU-Pro | 81% | 76% | 68% | Visual reasoning |
| Video-MMMU | 87.6% | 80.4% | 77.8% | Video understanding |
| ScreenSpot-Pro | 72.7% | 3.5% | — | 20x advantage in screen understanding |
| SimpleQA Verified | 72.1% | 34.9% | 29.3% | Factual accuracy |
The multimodal results are where Gemini 3 Pro dominates decisively. ScreenSpot-Pro measures the ability to understand and interact with desktop and mobile interfaces—Gemini’s 72.7% versus GPT-5.1’s 3.5% represents a 20x performance gap. This makes Gemini 3 Pro the clear choice for computer use and screen automation tasks.
Pricing breakdown
Gemini 3 Pro commands a 20-60% premium over Gemini 2.5 Pro, with pricing that doubles for prompts exceeding 200,000 tokens.
| Tier | Input (per MTok) | Output (per MTok) |
|---|---|---|
| Standard (≤200K) | $2.00 | $12.00 |
| Long context (>200K) | $4.00 | $18.00 |
| Cached input (≤200K) | $0.20 | — |
| Cached input (>200K) | $0.40 | — |
| Cache storage | $4.50/hour | — |
| Batch API (≤200K) | $1.00 | $6.00 |
| Batch API (>200K) | $2.00 | $9.00 |
Cost comparison with competitors
| Model | Input | Output | Context | Notes |
|---|---|---|---|---|
| Gemini 3 Pro | $2-4 | $12-18 | 1M | Best for long-context reasoning |
| Claude Opus 4.5 | $5.00 | $25.00 | 200K | Best coding quality |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 200K | Best coding value |
| GPT-5.1 | $1.25 | $10.00 | 400K | Cheapest frontier model |
Gemini 3 Pro offers the best price-performance for tasks requiring large context windows. Processing a 500K token document costs approximately $2.00 with Gemini versus being impossible with Claude (200K limit) or expensive with GPT-5.1 (degraded quality beyond 120K tokens).
No free API tier
Unlike previous Gemini models, gemini-3-pro-preview has no free API tier—only paid access is available. However, Google AI Studio offers free interactive sessions with rate limits of 10-50 requests per minute for experimentation.
How to access Gemini 3 Pro
Via API
Gemini 3 Pro is available through both Google AI Studio and Vertex AI with no waitlist. Basic usage:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
response = model.generate_content(
"Your prompt here",
generation_config={
"thinking_level": "high" # low, medium (unsupported), high
}
)
Enable context caching for repeated prompts:
from google.generativeai import caching
cache = caching.CachedContent.create(
model="gemini-3-pro-preview",
contents=["Your system prompt or reference content"],
ttl=datetime.timedelta(hours=1)
)
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Your query")
Rate limits scale with usage tier: 10 requests/minute on free tier, 50+ requests/minute on paid tiers. Vertex AI offers provisioned throughput options from $1,200/week to $2,000/month (annual commitment).
Via Gemini app
Access varies by subscription tier:
| Tier | Price | Gemini 3 Pro | Deep Think | Key features |
|---|---|---|---|---|
| Free | $0 | ✓ (limited) | ✗ | Via “Thinking” dropdown |
| Google AI Pro | $19.99/mo | ✓ (expanded) | ✗ | + Deep Research, Nano Banana Pro, Workspace integration |
| Google AI Ultra | $249.99/mo | ✓ (highest) | ✓ | + Deep Think, Project Mariner, 30TB storage, YouTube Premium |
Google AI Ultra is the only tier with access to Gemini 3 Deep Think, the extended reasoning mode that achieves 45.1% on ARC-AGI-2. Currently offered at 50% off for the first 3 months ($124.99/month).
AI Pro is available in 150+ countries; AI Ultra in 140+ countries.
How Gemini 3 Pro compares
vs Claude Opus 4.5
Claude leads on raw coding capability (80.9% vs 76.2% SWE-bench) and sustained autonomous engineering work. Multiple developers report Claude maintains better coherence in 30+ minute coding sessions. However, Gemini dominates on abstract reasoning (45.1% vs 37.6% ARC-AGI-2 with Deep Think) and offers 5x larger context windows.
Choose Claude for: complex refactoring, multi-file codebases, sustained autonomous coding, when code quality matters more than cost.
Choose Gemini for: abstract reasoning tasks, massive document analysis, UI/frontend development, multimodal workflows, computer use automation.
vs GPT-5.1
Gemini substantially outperforms GPT-5.1 on abstract reasoning (45.1% vs 17.6% ARC-AGI-2—nearly 3x better) and multimodal tasks (72.7% vs 3.5% on ScreenSpot-Pro—20x better). GPT-5.1 is 40% cheaper per input token and has better ecosystem integration.
Choose GPT-5.1 for: cost-sensitive production deployments, rapid iteration, ChatGPT ecosystem, when “good enough” ships faster.
Choose Gemini for: abstract reasoning, mathematical problems, screen understanding, video/audio processing, 1M token context needs.
The practical consensus
From community feedback: developers use Gemini 3 Pro for abstract reasoning, UI/frontend tasks, and long-context processing; Claude for complex coding work; and GPT-5.1 for speed and cost efficiency. Ben Tayloser from Creative Strategies noted after 5 days of testing: “Gemini 3 Pro was able to get to that 80% mark, other coding agents like Claude Code or Codex get 50-60% of the way there, then require a bunch of follow up.”
Known limitations
Independent testing and community reports reveal several significant weaknesses:
Hallucination behaviour is the primary concern. The AA-Omniscience benchmark reveals an 88% hallucination rate when the model is incorrect—matching Gemini 2.5’s rate despite accuracy improvements. The model achieves highest accuracy at 53% but remains overconfident when wrong, preferring fabrication over “I don’t know.” Zvi Mowshowitz summarised: “Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.”
Context degradation manifests at 300K+ tokens despite the 1M theoretical limit. Some users report issues with codebases under 10,000 lines. Multiple reports indicate the model “flakes out” with complex existing codebases.
Temporal confusion emerged as an early bug. Andrej Karpathy famously encountered the model refusing to believe the current date—when shown evidence, “the LLM accused Karpathy of gaslighting it—of uploading AI-generated fakes.”
Benchmark optimisation concerns persist. The model doesn’t know its own identity or version number and was trained on benchmark data (including BIG-bench canary strings). One user noted: “It feels oddly benchmarkmaxed. You can definitely feel the higher hallucination rate vs GPT.”
Content filtering is described as “tighter than expected,” refusing image generations with public figures and requiring rewording for some non-controversial historical questions.
Community reception
Overall sentiment runs approximately 75% positive, 25% negative, with praise focused on benchmark achievements and criticism centred on hallucination tendencies.
The positives
Enterprise partners report concrete productivity gains:
- Equifax trial: 97% of 1,500 employees requested to keep licences; 90% reported measurable productivity gains averaging 1+ hours saved daily
- Pinnacol Assurance: 96% time savings, 90% satisfaction
- AdVon Commerce: Processed 93,673-product catalogue in under a month (previously took a full year)
Developer tool makers praise the model’s agentic capabilities:
- Zach Lloyd, CEO of Warp: “Gemini 3 shows significant gains in reasoning, reliability in multi-step agent workflows, and an ability to debug tough development tasks with high-quality fixes. In early evaluations, it improved Warp’s Terminal Bench state of the art score by 20%.”
- JetBrains: Reported “more than a 50% improvement over Gemini 2.5 Pro in the number of solved benchmark tasks”
The negatives
The LessWrong and AI safety community has been sharply critical. Zvi Mowshowitz’s review titled “Gemini 3 Pro Is a Vast Intelligence With No Spine” captured the concern: the model’s overconfidence when wrong undermines its utility for critical applications.
A persistent criticism describes the model as “benchmarkmaxxed”—optimised for evaluation performance while exhibiting quality inconsistency in real-world scenarios. API reliability issues include CLI crashes reported “at least half the time” by some users, though speed when functional is praised.
The verdict
The consensus: Gemini 3 Pro is the best model for abstract reasoning and multimodal tasks, but its hallucination behaviour requires verification workflows. Claude remains the choice for coding work where correctness matters.
Version history
| Version | Released | Key changes |
|---|---|---|
| Gemini 3 Deep Think | Dec 4, 2025 | Extended reasoning for AI Ultra subscribers |
| Gemini 3 Pro | Nov 18, 2025 | First 1500+ LMArena Elo, dynamic thinking, 45.1% ARC-AGI-2 |
| Gemini 2.5 Pro | Mar 2025 | #1 on LMArena, adaptive thinking |
| Gemini 2.5 Flash | Mar 2025 | Fast, cost-efficient reasoning |
| Gemini 2.0 Flash | Dec 2024 | Agentic foundations |
| Gemini 1.5 Pro | Feb 2024 | 1M context window |
Improvements from Gemini 2.5 Pro
The performance gains from Gemini 2.5 Pro are substantial:
| Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Improvement |
|---|---|---|---|
| LMArena Elo | 1501 | 1451 | +50 points |
| SWE-bench Verified | 76.2% | 59.6% | +28% |
| Humanity’s Last Exam | 37.5% | 21.6% | +74% |
| GPQA Diamond | 91.9% | 86.4% | +6.4% |
| MMMU-Pro | 81% | 68% | +19% |
| Terminal-Bench 2.0 | 54.2% | 32.6% | +66% |
| MathArena Apex | 23.4% | 0.5% | +4,580% |
| ARC-AGI-2 | 31.1% | 4.9% | +535% |
FAQ
Is Gemini 3 Pro better than Claude Opus 4.5?
For coding, no—Claude leads with 80.9% vs 76.2% on SWE-bench. For abstract reasoning, yes—Gemini achieves 45.1% vs 37.6% on ARC-AGI-2. Choose based on your primary use case: Claude for coding, Gemini for reasoning and multimodal tasks.
How much does Gemini 3 Pro cost?
$2.00 per million input tokens (doubling to $4.00 for prompts exceeding 200K tokens), $12.00-18.00 per million output tokens. Cached inputs receive a 90% discount at $0.20/MTok. A typical conversation costs $0.02-0.05.
Can I use Gemini 3 Pro for free?
Limited free access is available through the Gemini app via the “Thinking” mode dropdown. API access requires payment—there is no free API tier for gemini-3-pro-preview.
What’s the difference between Gemini 3 Pro and Deep Think?
Deep Think is an extended reasoning mode that explores multiple reasoning paths simultaneously before converging. It improves ARC-AGI-2 performance from 31.1% to 45.1%—a 45% improvement—but is only available to Google AI Ultra subscribers ($249.99/month).
Should I upgrade from Gemini 2.5 Pro?
If you need frontier reasoning capability, yes—the improvements are substantial (28% better on SWE-bench, 535% better on ARC-AGI-2). If cost is a concern, note that pricing increased 20-60% from 2.5 Pro.
When will Gemini 3 Pro reach general availability?
Google expects GA in Q1 2026. The current gemini-3-pro-preview designation indicates preview status with potential API changes before stable release.
Does Gemini 3 Pro have a hallucination problem?
Yes. Independent testing shows an 88% hallucination rate when the model is incorrect—it prefers fabrication over admitting uncertainty. Always verify critical outputs, especially for factual claims outside its training data.