Claude Opus 4.8
- Provider
- Anthropic
- Status
- Current
- Context
- 1,000,000 tok
- SWE-bench
- 88.6%
- Price
- $5 / $25 /MTok
- Knowledge
- 2026-01
Claude Opus 4.8 is Anthropic’s current flagship, released on 28 May 2026 as an incremental upgrade to Opus 4.7. Anthropic itself frames it as “a modest but tangible improvement” rather than a generational leap — pricing is unchanged at $5/$25 per million tokens, the 1M-token context window stays the same, and the headline benchmark gains over 4.7 are mostly in the low single digits.
The more interesting story sits below the benchmark table. The standout change in Opus 4.8 is honesty: Anthropic reports the model is around four times less likely than Opus 4.7 to let flaws in code it has written pass unremarked (Anthropic). It is also the first generally-available Claude to be benchmarked at “Mythos-class” alignment levels. Where it does post hard numbers, Opus 4.8 leads its frontier-class peers on five of the six benchmarks Anthropic published — including a category-best 69.2% on the memorisation-resistant SWE-bench Pro (Vellum).
Quick specs
| Provider | Anthropic |
| Released | 28 May 2026 (API + Claude apps same day) |
| API model ID | claude-opus-4-8 |
| Context window | 1,000,000 tokens (200K on Microsoft Foundry) |
| Max output | 128,000 tokens |
| Knowledge cutoff | January 2026 |
| Input price | $5.00 / MTok |
| Output price | $25.00 / MTok |
| Fast mode | $10 / $50 per MTok at ~2.5x speed (research preview) |
| SWE-bench Verified | 88.6% |
| SWE-bench Pro | 69.2% |
| OSWorld-Verified | 83.4% |
| Humanity’s Last Exam (with tools) | 57.9% |
| Best for | Agentic coding, long-running autonomous work, professional/legal/finance knowledge work, computer use |
| Limitations | Modest gains over 4.7; trails Fable 5 on coding; loses Terminal-Bench to GPT-5.5 on GPT’s own harness |
What’s new in Claude Opus 4.8
Opus 4.8 builds directly on Opus 4.7 with the same price, context window and knowledge cutoff (January 2026). The differences are concentrated in four areas (Anthropic, Simon Willison).
Honesty about its own work
This is the change Anthropic leads with. A persistent failure mode in coding agents is the model declaring a task done — tests unrun, edge cases unchecked — and the user only noticing because something feels off. Anthropic says Opus 4.8 is around 4x less likely than Opus 4.7 to allow flaws in its own code to pass unremarked, and that early testers find it more willing to flag uncertainty and less likely to make unsupported claims.
The System Card adds a telling detail: Opus 4.8 “had the lowest incorrect-rate of the six models on every benchmark — the most direct measure of factual hallucination,” and it achieved this “mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly” (Anthropic System Card, via Simon Willison). In other words, the model became more honest partly by becoming more willing to say “I don’t know.”
Accuracy flag: the 4x figure and the hallucination results come from Anthropic’s own Alignment team via the System Card, not from an independent evaluator, and the evaluation protocol is not designed to be replayed outside Anthropic’s environment. Treat it as a credible directional claim rather than a third-party-verified number until independent testing lands.
Effort control
A new control sits alongside the model selector in claude.ai and Cowork, letting users choose how hard Claude works on a response. Opus 4.8 exposes five levels — low, medium, high, xhigh and max — and defaults to high, which Anthropic judges the best overall quality/experience balance. On coding tasks the default spends roughly the same tokens as Opus 4.7’s default but performs better; xhigh (“extra”) and max spend more for harder problems. Effort control is available on all plans.
Dynamic workflows in Claude Code
Launched alongside Opus 4.8 as a research preview, dynamic workflows let Claude Code plan a large task, fan out hundreds of parallel subagents in a single session, then verify its own outputs before reporting back. Anthropic’s example is a codebase-scale migration across hundreds of thousands of lines, from kickoff to merge, with the existing test suite as the bar. The feature is gated to Enterprise, Team and Max plans during the preview.
Cheaper fast mode and a new API primitive
Fast mode runs Opus 4.8 at ~2.5x speed for $10/$50 per million input/output tokens — three times cheaper than the $30/$150 fast mode on Opus 4.6 and 4.7 — though it remains limited to research-preview organisations who request access via an account manager (Anthropic fast mode docs). On the developer side, the Messages API now accepts role: "system" entries mid-conversation, so agents can update Claude’s instructions without restating the full system prompt or breaking the prompt cache. The minimum cacheable prompt has also dropped to 1,024 tokens (from 4,096 on Opus 4.7).
The Opus 4.8 model family
Unlike OpenAI’s GPT-5.1 Codex line, Opus 4.8 is not split into separate model IDs. There is one model — claude-opus-4-8 — and its behaviour is shaped by two dials.
| Dial | Options | What it does |
|---|---|---|
| Effort level | low, medium, high (default), xhigh, max | Trades latency and token spend for answer quality |
| Fast mode | On / off | ~2.5x speed at 2x token price; research-preview access only |
Most reported benchmark scores are at default high effort; Anthropic notes the higher tiers improve quality further. Within the wider Claude lineup, Opus 4.8 sits below the new Mythos-class Fable 5 and above the mid-tier Sonnet 4.6 and fast/cheap Haiku 4.5.
Benchmark performance
Anthropic published a six-benchmark comparison against Opus 4.7, GPT-5.5 and Gemini 3.1 Pro in the System Card. Note the Google comparator is Gemini 3.1 Pro — the model that was generally available on 28 May; Google’s newer Gemini 3.5 Pro began rolling out in June 2026 (still limited availability) and was not in Anthropic’s table. All figures below are Anthropic-reported.
Coding
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Pro | 69.2% | 64.3% | 58.6% | 54.2% |
| SWE-bench Verified | 88.6% | 87.6% | — | 80.6% |
| Terminal-Bench 2.1 (Terminus-2 harness) | 74.6% | 66.1% | 78.2% | 70.3% |
SWE-bench Pro is the hardest SWE-bench variant — multi-file diffs from actively-maintained repos with no public ground-truth leakage, the closest the field has to a coding benchmark that resists memorisation (Vellum). Opus 4.8’s 69.2% is ~5 points clear of Opus 4.7 and over 10 points ahead of GPT-5.5 and Gemini 3.1 Pro. The gap widens on the harder variant.
Terminal-Bench is the exception, and it’s a methodology story. On the public Terminus-2 harness used for every model, GPT-5.5 leads at 78.2% with Opus 4.8 at 74.6%. GPT-5.5’s headline 83.4% uses OpenAI’s own Codex CLI harness (Anthropic footnote). The like-for-like read is that GPT-5.5 is genuinely ahead on agentic terminal coding, and that Opus 4.8 jumped ~8 points over Opus 4.7 on the same harness.
Reasoning and knowledge
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Humanity’s Last Exam (with tools) | 57.9% | 54.7% | 52.2% | 51.4% |
| Humanity’s Last Exam (no tools) | 49.8% | 46.9% | 41.4% | 44.4% |
| GPQA Diamond | 93.6% | 94.2% | — | 94.3% |
Humanity’s Last Exam (HLE) is the hardest general-knowledge reasoning benchmark in regular rotation, and the one where real headroom still exists — Opus 4.8 takes the top of the range in both tool and no-tool configurations. GPQA Diamond, by contrast, is effectively saturated: Opus 4.8 (93.6), Opus 4.7 (94.2) and Gemini 3.1 Pro (94.3) are statistically tied, so the small dip for 4.8 here is noise rather than a regression.
Agentic and computer use
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| OSWorld-Verified | 83.4% | 82.8%* | 78.7% | 76.2% |
| Online-Mind2Web | 84% (Browserbase) | — | — | — |
| MCP-Atlas | 82.2% | — | — | — |
OSWorld-Verified runs an agent through real document, browsing and file tasks on a live Ubuntu VM. *Anthropic restated Opus 4.7’s score after a zoom-tool fix and a token-limit raise; its news footnote cites 82.3% while the System Card table reads 82.8% — a minor source conflict either way. Browserbase separately reports Opus 4.8 scoring 84% on Online-Mind2Web, “a meaningful jump over both Opus 4.7 and GPT-5.5” (Anthropic). These computer-use gains are the prerequisite for dynamic workflows being useful, not a vanity metric.
Professional knowledge work
| Benchmark | Opus 4.8 | GPT-5.5 | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| GDPval-AA (scale to 2000) | 1,890 | 1,769 | 1,753 | 1,314 |
| Finance Agent v2 | 53.9% | 51.8% | 51.5% | 43.0% |
GDPval-AA measures economically valuable knowledge work across professional domains, and shows the widest spread on the board: a 576-point gap between Opus 4.8 and Gemini 3.1 Pro. For knowledge-work-heavy applications the choice between an Anthropic model and Gemini 3.1 Pro is structural rather than incremental — though, again, Google’s newer Gemini 3.5 Pro was not in this comparison.
Finance Agent v2 is the one headline benchmark Opus 4.8 loses, and it loses to a smaller, cheaper model: Gemini 3.5 Flash at 57.9% (Vals AI, via Anthropic footnote). Opus 4.8 still leads the frontier-class field at 53.9%. The broader pattern — small specialised models winning specific verticals — is one to watch.
Independent verification
Artificial Analysis lists Opus 4.8 (proprietary, reasoning model, text + image input, 1M context, released May 2026) but, at the time of writing, had not published a numeric Intelligence Index score for it — so an independent composite figure is data not available for now. The most-cited independent explainer (Vellum) reproduces Anthropic’s System Card numbers rather than re-running them. Until third parties test it directly, the headline figures should be read as vendor-reported.
Pricing breakdown
Opus 4.8 keeps the Opus price that has held since Opus 4.5: $5 input, $25 output per million tokens (Anthropic).
| Mode | Input (per MTok) | Output (per MTok) | Notes |
|---|---|---|---|
| Standard | $5.00 | $25.00 | Unchanged from Opus 4.5/4.6/4.7 |
| Fast mode | $10.00 | $50.00 | ~2.5x speed; research-preview access only |
| Batch API | $2.50 | $12.50 | Standard 50% discount |
| Cached input | substantial discount | — | Standard Anthropic prompt caching; min cacheable prompt now 1,024 tokens |
The fast-mode change is the real pricing news: at $10/$50 it is a third of the $30/$150 that fast mode cost on Opus 4.6 and 4.7 (Simon Willison). Anthropic has not separately published an Opus 4.8 cache-read rate; standard prompt caching applies, with cache writes billed separately.
Cost comparison with contemporaries
| Model | Input | Output | Notes |
|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 | Frontier coding/agentic leader at this price |
| Claude Fable 5 | $10.00 | $50.00 | Mythos-class tier above Opus (access suspended — see below) |
| GPT-5.5 | $5.00 | $30.00 | Roughly matched on input; leads Terminal-Bench |
| Gemini 3.5 Pro | see Google | — | Newer Google flagship; not in Anthropic’s table |
For workflows priced on token spend, Anthropic also points to efficiency gains rather than headline price cuts — Databricks reports its Genie agent reasons over PDFs and diagrams at 61% cheaper token cost than Opus 4.7, and Cursor reports Opus 4.8 uses fewer steps for the same result (Anthropic).
How to access Claude Opus 4.8
Via API
Opus 4.8 is generally available with no waitlist as claude-opus-4-8 on the Claude API, and on Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry (200K context cap) and GitHub Copilot (AWS Bedrock model card).
from anthropic import Anthropic
client = Anthropic()
message = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
messages=[{"role": "user", "content": "Your prompt here"}],
)
print(message.content)
Via the Claude apps
Access depends on subscription tier (Anthropic pricing). The free tier does not include Opus 4.8 — free users get Sonnet 4.6 and Haiku 4.5.
| Tier | Price | Opus 4.8 | Notes |
|---|---|---|---|
| Free | $0 | No | Sonnet 4.6 / Haiku 4.5 only |
| Pro | $20/mo | Yes | Subject to usage limits |
| Max | from $100/mo | Yes | Higher limits; dynamic workflows (preview) |
| Team | $30/user/mo | Yes | Dynamic workflows available |
| Enterprise | Custom | Yes | Dynamic workflows available |
Opus 4.8 also powers Cowork and is selectable in Claude Code, where the xhigh effort level and dynamic workflows are most relevant.
How Claude Opus 4.8 compares
Opus 4.8’s June-2026 contemporaries are the Mythos-class Fable 5, OpenAI’s GPT-5.5, Google’s Gemini 3.5 Pro and xAI’s Grok 4.1.
vs Claude Fable 5
Fable 5 is Anthropic’s first Mythos-class model (released 9 June 2026) and sits above Opus 4.8 in the lineup, aimed at sustained-autonomy workloads like repository-scale coding. On coding it is clearly stronger — Fable 5 is reported at ~80.3% on SWE-bench Pro versus Opus 4.8’s 69.2% (Eden AI).
Important availability note (as of 16 June 2026): Fable 5 (and the gov-only Mythos 5) are currently suspended worldwide. On 12 June 2026 the US government issued an export-control directive ordering Anthropic to disable access for all foreign nationals; Anthropic complied by disabling the models for all customers while it seeks to restore access, and says it believes the underlying trigger was a misunderstanding (Anthropic statement; Al Jazeera). Opus 4.8 and the rest of the Claude lineup are unaffected. In practice, that makes Opus 4.8 Anthropic’s most capable generally available model right now.
vs GPT-5.5
On Anthropic’s own benchmarks, Opus 4.8 leads GPT-5.5 on SWE-bench Pro (69.2% vs 58.6%), HLE with tools (57.9% vs 52.2%), OSWorld-Verified (83.4% vs 78.7%) and GDPval-AA (1,890 vs 1,769). GPT-5.5’s clear win is agentic terminal coding: it leads Terminal-Bench 2.1 on both its own harness (83.4%) and the public one (78.2% vs Opus 4.8’s 74.6%). GPT-5.5 also matches Opus 4.8 on input price ($5) and offers a larger published presence in long-context/multimodal workloads.
Choose Opus 4.8 for coding quality, professional/legal/finance knowledge work, and reliability on long agentic runs. Choose GPT-5.5 for terminal-heavy agentic coding and where OpenAI’s ecosystem and harnesses are already in place.
vs Gemini 3.5 Pro
Anthropic benchmarked against Gemini 3.1 Pro, not the newer Gemini 3.5 Pro, so a clean Opus 4.8 vs Gemini 3.5 Pro head-to-head is data not available from primary sources at the time of writing — we won’t invent one. What we can say: against 3.1 Pro, Opus 4.8 led every published benchmark, with the largest gap on professional knowledge work (GDPval-AA 1,890 vs 1,314). Separately, Google’s smaller Gemini 3.5 Flash beat every model — Opus 4.8 included — on Finance Agent v2 (57.9%), a useful reminder that Google’s value tier is strong on specific verticals. See the Google hub for the current Gemini lineup.
vs Grok 4.1
xAI’s Grok 4.1 was not part of Anthropic’s published comparison, and we don’t have like-for-like, same-harness numbers against Opus 4.8 from a primary source — so head-to-head benchmarks are data not available. Directionally, Opus 4.8’s strengths are agentic coding reliability, professional knowledge work and computer use; Grok’s pitch centres on reasoning throughput and tight X/real-time integration. We’ll add verified figures once independent testing covers both on the same harness.
The practical consensus
Opus 4.8 leads five of the six benchmarks Anthropic published; the one it loses (Finance Agent v2) goes to a smaller, cheaper Gemini model. For day-to-day building, the honesty improvement — fewer silent “it’s done” claims — is the change most likely to be felt, even though it’s the hardest to put on a chart (Vellum). See our best AI models and best AI for coding rankings for where it sits across the wider field.
Known limitations
Modest gains over 4.7. Anthropic is upfront that this is an incremental release. On most benchmarks the Opus 4.7 → 4.8 delta is low single digits, and on GPQA Diamond the models are tied. If you’re on Opus 4.7, the upgrade is real but small.
Loses Terminal-Bench to GPT-5.5. On agentic terminal coding, GPT-5.5 is ahead on the public harness (78.2% vs 74.6%) and further ahead on its own (83.4%).
Not the strongest Claude for coding. Fable 5 is materially better on coding benchmarks — but it’s currently suspended (see above), so the gap is academic for most users right now.
Vendor-reported benchmarks. Every headline figure is Anthropic-run via the System Card, including the 4x honesty claim, which is not independently replayable. Independent composite scores (e.g. Artificial Analysis Intelligence Index) were not yet available at the time of writing.
Fast mode is gated. The cheaper 2.5x fast mode is limited to research-preview organisations; most users can’t access it yet.
Free tier excluded. Opus 4.8 requires a paid plan; free users fall back to Sonnet 4.6 / Haiku 4.5.
Community reception
The reception skewed positive, with much of the praise aimed at the honesty framing rather than raw capability.
Simon Willison called it “so refreshing to see an AI lab honestly describe a release as a minor incremental improvement,” and highlighted the abstain-when-uncertain finding as the most interesting result (Simon Willison).
Anthropic’s launch partners reported concrete agentic wins (Anthropic):
- Cursor (Michael Truell): on CursorBench, Opus 4.8 “exceeds prior Opus models across every effort level,” with more efficient tool calling — “fewer steps for the same intelligence.”
- Cognition / Devin (Scott Wu): it “fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7.”
- Browserbase (Miguel Gonzalez): “the strongest computer-use and browser-agent model we’ve tested,” 84% on Online-Mind2Web.
- Harvey (Niko Grupen): the first model to break 10% on its Legal Agent Benchmark all-pass standard.
- Databricks (Hanlin Tang): a “step change in agentic reasoning” in Genie, reasoning over PDFs and diagrams at 61% cheaper token cost than Opus 4.7.
The recurring theme across testers — better judgement, catching its own mistakes, pushing back on weak plans — lines up with the honesty story Anthropic leads with.
Version history
| Version | Released | Key changes |
|---|---|---|
| Claude Opus 4.8 | 28 May 2026 | Honesty (4x fewer unflagged code flaws), effort control, dynamic workflows, 3x cheaper fast mode |
| Claude Opus 4.7 | 16 Apr 2026 | First model with automatic cyber safeguards post-Glasswing; new tokenizer; stronger vision |
| Claude Opus 4.6 | (earlier 2026) | Prior Opus iteration |
| Claude Opus 4.5 | (late 2025) | Established the $5/$25 Opus price point |
Anthropic has also said it expects to bring Mythos-class models to all customers “in the coming weeks” once stronger cyber safeguards are in place, pending the export-control situation (Anthropic; Help Net Security). Opus 4.7 remains available via the API for teams that haven’t migrated.
FAQ
Is Claude Opus 4.8 better than Opus 4.7?
Yes, but modestly. It leads Opus 4.7 on every published benchmark (e.g. 69.2% vs 64.3% on SWE-bench Pro) and is ~4x less likely to let its own code flaws go unflagged — but Anthropic itself calls it “a modest but tangible improvement.” The honesty and tool-calling gains matter more in daily use than the benchmark deltas.
How much does Claude Opus 4.8 cost?
$5 per million input tokens and $25 per million output tokens — unchanged from Opus 4.5/4.6/4.7. Fast mode is $10/$50 at ~2.5x speed but is limited to research-preview organisations. On the apps, it requires a Pro ($20/mo) or higher plan.
Can I use Claude Opus 4.8 for free?
No. The free tier on claude.ai gives you Sonnet 4.6 and Haiku 4.5, not Opus 4.8. You need Pro, Max, Team or Enterprise.
What is the context window and knowledge cutoff?
A 1,000,000-token context window (capped at 200K on Microsoft Foundry) with up to 128,000 output tokens, and a January 2026 knowledge cutoff — the same as Opus 4.7.
How does Opus 4.8 compare to GPT-5.5?
Opus 4.8 leads on SWE-bench Pro, Humanity’s Last Exam, OSWorld-Verified and GDPval-AA on Anthropic’s benchmarks. GPT-5.5 leads agentic terminal coding (Terminal-Bench 2.1). They’re matched on input price ($5/MTok). Choose Opus 4.8 for coding quality and knowledge work; GPT-5.5 for terminal-heavy agentic coding.
Is Opus 4.8 Anthropic’s most capable model?
It’s the most capable generally available one. The Mythos-class Fable 5 is stronger on coding but, as of 16 June 2026, is suspended worldwide under a US export-control directive.
What is “effort control”?
A user-facing dial (low, medium, high, xhigh, max) that controls how many tokens Claude spends on a response. It defaults to high; higher levels improve quality on hard tasks at the cost of speed and rate limits. It’s available on all plans.
Last verified 16 June 2026. Benchmark figures are Anthropic-reported via the Claude Opus 4.8 System Card unless otherwise noted; independent composite scores were not yet available at the time of writing. Pricing and availability current as of the publication date and subject to change.