GPT-5.1
- Provider
- OpenAI
- Status
- Current
- Context
- 400,000 tok
- SWE-bench
- 76.3%
- Price
- $1.25 / $10 /MTok
- Knowledge
- 2024-09-30
GPT-5.1 is OpenAI’s mid-cycle refinement to the GPT-5 family, released November 12, 2025—three months after GPT-5’s August debut. Rather than a capability leap, it’s an optimisation release focused on speed, efficiency, and developer experience. The model introduces adaptive reasoning that dynamically allocates compute based on query complexity, delivering 2-3x faster responses on straightforward tasks while using 88% fewer tokens on easy queries.
At 76.3% on SWE-bench Verified, GPT-5.1 trails Claude Opus 4.5’s industry-leading 80.9% but costs 4x less per input token. The release also introduced GPT-5.1-Codex-Max, OpenAI’s most ambitious coding model—the first trained to operate coherently across millions of tokens and sustain autonomous work for 24+ hours.
Quick specs
| Provider | OpenAI |
| Released | November 12, 2025 (ChatGPT) / November 13, 2025 (API) |
| Context window | 400K tokens (272K input / 128K output) |
| Knowledge cutoff | September 30, 2024 |
| Input price | $1.25 / MTok |
| Output price | $10.00 / MTok |
| Cached input | $0.125 / MTok (90% discount) |
| SWE-bench Verified | 76.3% (77.9% Codex-Max) |
| GPQA Diamond | ~87% |
| MMMU | 85.4% |
| Best for | Fast iteration, production apps, price-sensitive deployments |
| Limitations | Coding trails Claude, abstract reasoning trails Gemini |
What’s new in GPT-5.1
GPT-5.1 addresses GPT-5’s most criticised shortcomings: speed and cost efficiency. The improvements fall into three categories.
Adaptive reasoning architecture
The model now operates in two modes. Instant mode (gpt-5.1-chat-latest) handles everyday queries with a warmer, more conversational tone and 2-3x faster response times. Thinking mode (gpt-5.1) engages deeper reasoning for complex problems, allocating twice the compute on difficult tasks while using 88% fewer tokens on easy ones. An automatic routing layer selects between modes based on query complexity.
A new reasoning_effort parameter gives developers fine-grained control with four levels: none, low, medium, and high. The none setting effectively transforms GPT-5.1 into a non-reasoning model for latency-sensitive applications.
Extended prompt caching
The standout infrastructure improvement is 24-hour prompt caching (enabled via prompt_cache_retention='24h'), a massive upgrade from the previous few-minute retention. Early adopters report 73% average cache hit rates, making the 90% input discount practical for production workloads rather than theoretical.
New developer tools
Two tools target agentic coding workflows. The apply_patch tool enables surgical code editing using structured diffs instead of full file rewrites—dramatically improving accuracy for large codebases. The shell tool provides controlled command-line access for autonomous agents.
The GPT-5.1 model family
OpenAI released seven variants designed for different use cases:
| Variant | API Identifier | Purpose | Notes |
|---|---|---|---|
| GPT-5.1 Instant | gpt-5.1-chat-latest | Fast conversational responses | 2-3x faster than GPT-5 |
| GPT-5.1 Thinking | gpt-5.1 | Complex multi-step reasoning | Adaptive compute allocation |
| GPT-5.1 Auto | — | Automatic mode routing | Selects Instant vs Thinking |
| GPT-5.1 Pro | — | Research-grade intelligence | Pro subscribers only ($200/mo) |
| GPT-5.1-Codex | gpt-5.1-codex | Standard coding tasks | — |
| GPT-5.1-Codex-Mini | gpt-5.1-codex-mini | Lightweight coding | — |
| GPT-5.1-Codex-Max | gpt-5.1-codex-max | Long-running agentic coding | 24+ hour tasks, 77.9% SWE-bench |
GPT-5.1-Codex-Max deserves special attention. Released November 19, 2025, it’s OpenAI’s first model trained to operate coherently across millions of tokens in single tasks. It uses 30% fewer thinking tokens than standard Codex while achieving 77.9% on SWE-bench Verified. OpenAI claims it completed a 24-hour internal coding task autonomously.
Benchmark performance
GPT-5.1 shows incremental rather than dramatic gains over GPT-5. Independent testing from Vals.ai reveals smaller improvements than OpenAI’s marketing suggests.
Coding benchmarks
| Model | SWE-bench Verified | Notes |
|---|---|---|
| Claude Opus 4.5 | 80.9% | Industry leader |
| Claude Sonnet 4.5 | 77.2% | Best value for coding |
| GPT-5.1-Codex-Max | 77.9% | OpenAI’s best |
| GPT-5.1 | 76.3% | +1.4pp over GPT-5 |
| GPT-5 | 74.9% | — |
| Gemini 3 Pro Preview | 76.2% | — |
SWE-bench Verified tests models on real GitHub pull requests. A 76.3% score means GPT-5.1 can resolve roughly 3 out of 4 actual bug reports without human intervention—though real-world IDE performance typically runs 40-60% lower due to latency constraints.
Reasoning benchmarks
| Benchmark | GPT-5.1 | GPT-5 | Best in class |
|---|---|---|---|
| GPQA Diamond | ~87% | 87.3% | Claude Opus 4.5 (87%) |
| AIME 2025 | 94.0% | 94.6% | Virtually tied |
| ARC-AGI-2 | 17.6% | — | Gemini 3 Pro (45.1%) |
| MMMU | 85.4% | 84.2% | GPT-5.1 leads |
The AIME 2025 result is notable: GPT-5.1 actually scores marginally lower than GPT-5 on this math competition benchmark. Abstract reasoning (ARC-AGI-2) significantly trails Gemini 3 Pro. GPT-5.1’s strength is visual reasoning (MMMU 85.4%).
Industry rankings
Artificial Analysis ranks GPT-5.1 as the second most intelligent LLM as of November 2025, with the GPT-5 family scoring 68 on their composite Intelligence Index at high reasoning effort.
Pricing breakdown
GPT-5.1 matches GPT-5’s pricing while introducing more aggressive caching discounts.
| Tier | Input (per MTok) | Output (per MTok) |
|---|---|---|
| Standard | $1.25 | $10.00 |
| Cached input | $0.125 | — |
| Batch API | $0.625 | $5.00 |
| Priority | $2.50 | $20.00 |
Cost comparison with competitors
| Model | Input | Output | Notes |
|---|---|---|---|
| GPT-5.1 | $1.25 | $10.00 | Best price-performance |
| Claude Opus 4.5 | $5.00 | $25.00 | 4x more expensive input |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Best coding value |
| Gemini 3 Pro | ~$1.25 | ~$5.00 | Competitive |
One developer reported spending $2,220 on GPT-5.1 versus $6,020 on Claude Opus 4.5 for identical workloads—though Claude’s code quality rated higher. The trade-off is real: GPT-5.1 wins on cost, Claude wins on capability.
Hidden cost: reasoning tokens
Be aware that hidden reasoning tokens can inflate costs 4-5x compared to GPT-4.1 for complex tasks. The model generates internal “thinking” tokens that count against your output quota but aren’t visible in responses.
How to access GPT-5.1
Via API
GPT-5.1 is generally available with no waitlist. Basic usage:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5.1",
messages=[{"role": "user", "content": "Your prompt here"}],
reasoning_effort="medium" # none, low, medium, high
)
Enable 24-hour prompt caching:
response = client.chat.completions.create(
model="gpt-5.1",
messages=[{"role": "user", "content": "Your prompt"}],
extra_body={"prompt_cache_retention": "24h"}
)
Rate limits scale with spend: 500K tokens/minute at Tier 1 ($5 spent), up to 40M tokens/minute at Tier 5 ($1,000+ spent).
Via ChatGPT
Access varies dramatically by subscription:
| Tier | Context | Rate limit | Cost |
|---|---|---|---|
| Free | 8K tokens | 10 msgs / 5 hours | $0 |
| Plus | 32K tokens | 160 msgs / 3 hours | $20/mo |
| Pro | 196K tokens | Unlimited | $200/mo |
| Enterprise | 196K tokens | Unlimited | Custom |
Free and Plus users fall back to GPT-5.1-mini when limits are reached. Pro subscribers get access to GPT-5.1 Pro, the research-grade variant.
How GPT-5.1 compares
vs Claude Opus 4.5
Claude leads on raw coding capability (80.9% vs 76.3% SWE-bench) and sustained engineering work. Developer Mckay Wrigley declared “Claude Code with Opus is still king and frankly it’s not close.” However, Claude costs 4x more for input tokens and 2.5x more for output.
Choose Claude for: complex refactoring, multi-file codebases, when code quality matters more than cost.
Choose GPT-5.1 for: rapid iteration, cost-sensitive production deployments, when “good enough” code ships faster.
vs Gemini 3 Pro
Gemini dominates abstract reasoning (45.1% vs 17.6% on ARC-AGI-2) and offers the largest context window at 1M tokens. Developers report reaching for Gemini “for UI/frontend tasks” where it excels.
Choose Gemini for: abstract reasoning tasks, UI/UX design, when you need massive context.
Choose GPT-5.1 for: most general-purpose work, better ecosystem integration.
The practical consensus
From community feedback: developers use GPT-5.1 for speed and daily iteration, Gemini for UI tasks, and reserve Claude or GPT-5.1 Pro “when they cannot afford to be wrong.”
Known limitations
Independent testing and community reports reveal several weaknesses:
Context degradation begins noticeably around 120K tokens despite the 272K theoretical input limit. Performance degrades on very long inputs.
Hidden reasoning costs can inflate bills 4-5x. The model generates invisible “thinking” tokens that count against output quotas.
Creative writing feels “pre-packaged” according to professional writers. The model resists style customisation and tends toward generic responses.
Settings drift over extended conversations. Personal preferences like tone and punctuation fade as context grows.
Frontend/UX design remains “far worse than Gemini 3” according to multiple reviewers.
Safety routing silently redirects some users to stricter model variants without notification—a major source of community frustration.
Community reception
The developer community responded more favourably to GPT-5.1 than GPT-5’s mixed August launch, though significant frustrations persist.
The positives
Enterprise partners report concrete wins:
- Balyasny Asset Management: “Outperformed both GPT-4.1 and GPT-5 while running 2-3x faster”
- Pace AI: “50% faster agents while exceeding accuracy”
- Sierra: “20% improvement on low-latency tool calling”
AI influencer Matt Shumer called GPT-5.1 Pro “ridiculously smart… feels like a better reasoner than most humans.” JetBrains’ Denis Shiryaev described it as “genuinely agentic, the most naturally autonomous model I’ve ever tested.”
The negatives
OpenAI’s Reddit AMA about GPT-5.1 became a “karma massacre” with 1,300+ downvotes and 1,200+ comments—most highly critical. Primary complaints:
- Safety router silently redirecting to restricted models
- Overzealous content filters triggering on benign content (D&D roleplay, work venting)
- Loss of GPT-4o’s personality that users had formed attachments to
- The warmer default personality dividing users between those relieved it “finally feels less like a PDF” and those frustrated by “talking to a life coach”
The verdict
The consensus: Claude excels for engineers who need peak code quality; GPT-5.1 wins for builders focused on speed, cost, and iteration.
Version history
| Version | Released | Key changes |
|---|---|---|
| GPT-5.1-Codex-Max | Nov 19, 2025 | Flagship agentic coding, 24+ hour tasks |
| GPT-5.1 | Nov 12-13, 2025 | Adaptive reasoning, 24h caching, 2-3x speed |
| GPT-5 | Aug 2025 | Initial GPT-5 release |
| GPT-4.1 | Apr 2025 | 1M context, agentic focus |
| GPT-4o | Nov 2024 | Multimodal, 128K context |
GPT-5 remains available in the ChatGPT legacy dropdown for approximately 90 days post-launch.
FAQ
Is GPT-5.1 better than Claude Opus 4.5?
Not for coding—Claude leads with 80.9% vs 76.3% on SWE-bench. GPT-5.1 wins on price (4x cheaper input) and speed. Choose based on whether you prioritise capability or cost.
How much does GPT-5.1 cost?
$1.25 per million input tokens, $10.00 per million output tokens. Cached inputs drop to $0.125/MTok (90% off). A typical 1,000-word conversation costs roughly $0.01-0.03.
Can I use GPT-5.1 for free?
Yes, but with tight limits: 10 messages per 5 hours on ChatGPT Free, then you fall back to GPT-5.1-mini. API access requires payment.
What’s the difference between GPT-5.1 and GPT-5.1-Codex-Max?
Codex-Max is optimised for long-running agentic coding tasks—it can work autonomously for 24+ hours across millions of tokens. Standard GPT-5.1 is for general-purpose use.
Is GPT-5.1 worth upgrading from GPT-5?
If you use the API regularly, yes—the 2-3x speed improvement and extended caching offer real value. If you’re on ChatGPT Plus, the upgrade is automatic.
When will fine-tuning be available?
OpenAI hasn’t announced fine-tuning for GPT-5.1. Currently only GPT-4o and older models support fine-tuning.