GPT-5.1

Provider: OpenAI
Status: Current
Context: 400,000 tok
SWE-bench: 76.3%
Price: $1.25 / $10 /MTok
Knowledge: 2024-09-30

GPT-5.1 is OpenAI’s mid-cycle refinement to the GPT-5 family, released November 12, 2025—three months after GPT-5’s August debut. Rather than a capability leap, it’s an optimisation release focused on speed, efficiency, and developer experience. The model introduces adaptive reasoning that dynamically allocates compute based on query complexity, delivering 2-3x faster responses on straightforward tasks while using 88% fewer tokens on easy queries.

At 76.3% on SWE-bench Verified, GPT-5.1 trails Claude Opus 4.5’s industry-leading 80.9% but costs 4x less per input token. The release also introduced GPT-5.1-Codex-Max, OpenAI’s most ambitious coding model—the first trained to operate coherently across millions of tokens and sustain autonomous work for 24+ hours.

Quick specs


Provider	OpenAI
Released	November 12, 2025 (ChatGPT) / November 13, 2025 (API)
Context window	400K tokens (272K input / 128K output)
Knowledge cutoff	September 30, 2024
Input price	$1.25 / MTok
Output price	$10.00 / MTok
Cached input	$0.125 / MTok (90% discount)
SWE-bench Verified	76.3% (77.9% Codex-Max)
GPQA Diamond	~87%
MMMU	85.4%
Best for	Fast iteration, production apps, price-sensitive deployments
Limitations	Coding trails Claude, abstract reasoning trails Gemini

TRY GPT-5.1 →

What’s new in GPT-5.1

GPT-5.1 addresses GPT-5’s most criticised shortcomings: speed and cost efficiency. The improvements fall into three categories.

Adaptive reasoning architecture

The model now operates in two modes. Instant mode (gpt-5.1-chat-latest) handles everyday queries with a warmer, more conversational tone and 2-3x faster response times. Thinking mode (gpt-5.1) engages deeper reasoning for complex problems, allocating twice the compute on difficult tasks while using 88% fewer tokens on easy ones. An automatic routing layer selects between modes based on query complexity.

A new reasoning_effort parameter gives developers fine-grained control with four levels: none, low, medium, and high. The none setting effectively transforms GPT-5.1 into a non-reasoning model for latency-sensitive applications.

Extended prompt caching

The standout infrastructure improvement is 24-hour prompt caching (enabled via prompt_cache_retention='24h'), a massive upgrade from the previous few-minute retention. Early adopters report 73% average cache hit rates, making the 90% input discount practical for production workloads rather than theoretical.

New developer tools

Two tools target agentic coding workflows. The apply_patch tool enables surgical code editing using structured diffs instead of full file rewrites—dramatically improving accuracy for large codebases. The shell tool provides controlled command-line access for autonomous agents.

The GPT-5.1 model family

OpenAI released seven variants designed for different use cases:

Variant	API Identifier	Purpose	Notes
GPT-5.1 Instant	`gpt-5.1-chat-latest`	Fast conversational responses	2-3x faster than GPT-5
GPT-5.1 Thinking	`gpt-5.1`	Complex multi-step reasoning	Adaptive compute allocation
GPT-5.1 Auto	—	Automatic mode routing	Selects Instant vs Thinking
GPT-5.1 Pro	—	Research-grade intelligence	Pro subscribers only ($200/mo)
GPT-5.1-Codex	`gpt-5.1-codex`	Standard coding tasks	—
GPT-5.1-Codex-Mini	`gpt-5.1-codex-mini`	Lightweight coding	—
GPT-5.1-Codex-Max	`gpt-5.1-codex-max`	Long-running agentic coding	24+ hour tasks, 77.9% SWE-bench

GPT-5.1-Codex-Max deserves special attention. Released November 19, 2025, it’s OpenAI’s first model trained to operate coherently across millions of tokens in single tasks. It uses 30% fewer thinking tokens than standard Codex while achieving 77.9% on SWE-bench Verified. OpenAI claims it completed a 24-hour internal coding task autonomously.

Benchmark performance

GPT-5.1 shows incremental rather than dramatic gains over GPT-5. Independent testing from Vals.ai reveals smaller improvements than OpenAI’s marketing suggests.

Coding benchmarks

Model	SWE-bench Verified	Notes
Claude Opus 4.5	80.9%	Industry leader
Claude Sonnet 4.5	77.2%	Best value for coding
GPT-5.1-Codex-Max	77.9%	OpenAI’s best
GPT-5.1	76.3%	+1.4pp over GPT-5
GPT-5	74.9%	—
Gemini 3 Pro Preview	76.2%	—

SWE-bench Verified tests models on real GitHub pull requests. A 76.3% score means GPT-5.1 can resolve roughly 3 out of 4 actual bug reports without human intervention—though real-world IDE performance typically runs 40-60% lower due to latency constraints.

Reasoning benchmarks

Benchmark	GPT-5.1	GPT-5	Best in class
GPQA Diamond	~87%	87.3%	Claude Opus 4.5 (87%)
AIME 2025	94.0%	94.6%	Virtually tied
ARC-AGI-2	17.6%	—	Gemini 3 Pro (45.1%)
MMMU	85.4%	84.2%	GPT-5.1 leads

The AIME 2025 result is notable: GPT-5.1 actually scores marginally lower than GPT-5 on this math competition benchmark. Abstract reasoning (ARC-AGI-2) significantly trails Gemini 3 Pro. GPT-5.1’s strength is visual reasoning (MMMU 85.4%).

Industry rankings

Artificial Analysis ranks GPT-5.1 as the second most intelligent LLM as of November 2025, with the GPT-5 family scoring 68 on their composite Intelligence Index at high reasoning effort.

Pricing breakdown

GPT-5.1 matches GPT-5’s pricing while introducing more aggressive caching discounts.

Tier	Input (per MTok)	Output (per MTok)
Standard	$1.25	$10.00
Cached input	$0.125	—
Batch API	$0.625	$5.00
Priority	$2.50	$20.00

Cost comparison with competitors

Model	Input	Output	Notes
GPT-5.1	$1.25	$10.00	Best price-performance
Claude Opus 4.5	$5.00	$25.00	4x more expensive input
Claude Sonnet 4.5	$3.00	$15.00	Best coding value
Gemini 3 Pro	~$1.25	~$5.00	Competitive

One developer reported spending $2,220 on GPT-5.1 versus $6,020 on Claude Opus 4.5 for identical workloads—though Claude’s code quality rated higher. The trade-off is real: GPT-5.1 wins on cost, Claude wins on capability.

Hidden cost: reasoning tokens

Be aware that hidden reasoning tokens can inflate costs 4-5x compared to GPT-4.1 for complex tasks. The model generates internal “thinking” tokens that count against your output quota but aren’t visible in responses.

How to access GPT-5.1

Via API

GPT-5.1 is generally available with no waitlist. Basic usage:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5.1",
    messages=[{"role": "user", "content": "Your prompt here"}],
    reasoning_effort="medium"  # none, low, medium, high
)

Enable 24-hour prompt caching:

response = client.chat.completions.create(
    model="gpt-5.1",
    messages=[{"role": "user", "content": "Your prompt"}],
    extra_body={"prompt_cache_retention": "24h"}
)

Rate limits scale with spend: 500K tokens/minute at Tier 1 ($5 spent), up to 40M tokens/minute at Tier 5 ($1,000+ spent).

Via ChatGPT

Access varies dramatically by subscription:

Tier	Context	Rate limit	Cost
Free	8K tokens	10 msgs / 5 hours	$0
Plus	32K tokens	160 msgs / 3 hours	$20/mo
Pro	196K tokens	Unlimited	$200/mo
Enterprise	196K tokens	Unlimited	Custom

Free and Plus users fall back to GPT-5.1-mini when limits are reached. Pro subscribers get access to GPT-5.1 Pro, the research-grade variant.

How GPT-5.1 compares

vs Claude Opus 4.5

Claude leads on raw coding capability (80.9% vs 76.3% SWE-bench) and sustained engineering work. Developer Mckay Wrigley declared “Claude Code with Opus is still king and frankly it’s not close.” However, Claude costs 4x more for input tokens and 2.5x more for output.

Choose Claude for: complex refactoring, multi-file codebases, when code quality matters more than cost.

Choose GPT-5.1 for: rapid iteration, cost-sensitive production deployments, when “good enough” code ships faster.

vs Gemini 3 Pro

Gemini dominates abstract reasoning (45.1% vs 17.6% on ARC-AGI-2) and offers the largest context window at 1M tokens. Developers report reaching for Gemini “for UI/frontend tasks” where it excels.

Choose Gemini for: abstract reasoning tasks, UI/UX design, when you need massive context.

Choose GPT-5.1 for: most general-purpose work, better ecosystem integration.

The practical consensus

From community feedback: developers use GPT-5.1 for speed and daily iteration, Gemini for UI tasks, and reserve Claude or GPT-5.1 Pro “when they cannot afford to be wrong.”

Known limitations

Independent testing and community reports reveal several weaknesses:

Context degradation begins noticeably around 120K tokens despite the 272K theoretical input limit. Performance degrades on very long inputs.

Hidden reasoning costs can inflate bills 4-5x. The model generates invisible “thinking” tokens that count against output quotas.

Creative writing feels “pre-packaged” according to professional writers. The model resists style customisation and tends toward generic responses.

Settings drift over extended conversations. Personal preferences like tone and punctuation fade as context grows.

Frontend/UX design remains “far worse than Gemini 3” according to multiple reviewers.

Safety routing silently redirects some users to stricter model variants without notification—a major source of community frustration.

Community reception

The developer community responded more favourably to GPT-5.1 than GPT-5’s mixed August launch, though significant frustrations persist.

The positives

Enterprise partners report concrete wins:

Balyasny Asset Management: “Outperformed both GPT-4.1 and GPT-5 while running 2-3x faster”
Pace AI: “50% faster agents while exceeding accuracy”
Sierra: “20% improvement on low-latency tool calling”

AI influencer Matt Shumer called GPT-5.1 Pro “ridiculously smart… feels like a better reasoner than most humans.” JetBrains’ Denis Shiryaev described it as “genuinely agentic, the most naturally autonomous model I’ve ever tested.”

The negatives

OpenAI’s Reddit AMA about GPT-5.1 became a “karma massacre” with 1,300+ downvotes and 1,200+ comments—most highly critical. Primary complaints:

Safety router silently redirecting to restricted models
Overzealous content filters triggering on benign content (D&D roleplay, work venting)
Loss of GPT-4o’s personality that users had formed attachments to
The warmer default personality dividing users between those relieved it “finally feels less like a PDF” and those frustrated by “talking to a life coach”

The verdict

The consensus: Claude excels for engineers who need peak code quality; GPT-5.1 wins for builders focused on speed, cost, and iteration.

Version history

Version	Released	Key changes
GPT-5.1-Codex-Max	Nov 19, 2025	Flagship agentic coding, 24+ hour tasks
GPT-5.1	Nov 12-13, 2025	Adaptive reasoning, 24h caching, 2-3x speed
GPT-5	Aug 2025	Initial GPT-5 release
GPT-4.1	Apr 2025	1M context, agentic focus
GPT-4o	Nov 2024	Multimodal, 128K context

GPT-5 remains available in the ChatGPT legacy dropdown for approximately 90 days post-launch.

FAQ

Is GPT-5.1 better than Claude Opus 4.5?

Not for coding—Claude leads with 80.9% vs 76.3% on SWE-bench. GPT-5.1 wins on price (4x cheaper input) and speed. Choose based on whether you prioritise capability or cost.

How much does GPT-5.1 cost?

$1.25 per million input tokens, $10.00 per million output tokens. Cached inputs drop to $0.125/MTok (90% off). A typical 1,000-word conversation costs roughly $0.01-0.03.

Can I use GPT-5.1 for free?

Yes, but with tight limits: 10 messages per 5 hours on ChatGPT Free, then you fall back to GPT-5.1-mini. API access requires payment.

What’s the difference between GPT-5.1 and GPT-5.1-Codex-Max?

Codex-Max is optimised for long-running agentic coding tasks—it can work autonomously for 24+ hours across millions of tokens. Standard GPT-5.1 is for general-purpose use.

Is GPT-5.1 worth upgrading from GPT-5?

If you use the API regularly, yes—the 2-3x speed improvement and extended caching offer real value. If you’re on ChatGPT Plus, the upgrade is automatic.

When will fine-tuning be available?

OpenAI hasn’t announced fine-tuning for GPT-5.1. Currently only GPT-4o and older models support fine-tuning.