Claude Opus 4.7 vs GPT-5.4: Benchmarks, Price, and What Devs Actually Say
Anthropic shipped Claude Opus 4.7 on April 16, 2026 — forty-two days after OpenAI's GPT-5.4 launch. The headline: SWE-bench Pro 64.3% vs 57.7%, and SWE-bench Verified pushed to 87.6%. But benchmarks describe a model; developers describe a workflow. This is the comparison that reads the benchmarks, then reads the r/ClaudeAI and r/OpenAI threads where developers who pay for both subscriptions are comparing them in real code.
TL;DR — Key Takeaways:
- • SWE-bench Pro 64.3% vs 57.7% — Opus 4.7 leads by 6.6 points, third-party reproductions broadly confirm
- • SWE-bench Verified 87.6% — up from Opus 4.6's 80.8%, measurably better on real-world patches
- • GPT-5.4 still wins on OSWorld (75%) — computer-use automation and long-context Codex workflows
- • Price $5/$25 per MTok unchanged — but new tokenizer means 1.0 to 1.35x more tokens per prompt
- • Tool-call errors down ~66% — Anthropic's cited improvement holds up in agentic workflows
- • Reddit verdict: run both — Claude for architectural intent, Codex/GPT-5.4 for edge cases and review
1. Launch Timeline and Why It Matters
GPT-5.4 launched on March 5, 2026 as OpenAI's first unified release across ChatGPT, the API, and Codex simultaneously. It arrived with 75% on OSWorld-Verified — above human performance of 72.4% — plus 83% on GDPval (knowledge-work tasks) and 57.7% on SWE-bench Pro. A Mini variant followed twelve days later. For roughly six weeks, GPT-5.4 sat at the top of most coding leaderboards.
On April 16, 2026, Anthropic released Opus 4.7. The release was deliberately narrow in scope — it's a capability refresh on the Opus 4.6 base rather than a ground-up redesign. The three headline changes: SWE-bench Pro jumped from 53.4% (4.6) to 64.3% (4.7), SWE-bench Verified went from 80.8% to 87.6%, and internal evaluations showed tool-call errors reduced to roughly a third of the previous rate. Image input resolution tripled.
For developers who've been paying attention since Opus 4.5, this is Anthropic's third straight Opus release in four months. The r/singularity thread on the release captured the mood: "benchmarks going up, my ability to keep track of model versions going down." One commenter summarized the simple rule that's emerged for anyone not tracking releases daily: "Opus is best, then Sonnet, then Haiku. Use whatever is the biggest number."
The Six-Week Window That Closed
What the launch sequence tells us: GPT-5.4's SWE-bench Pro lead lasted about six weeks. Opus 4.7 didn't just match it — it cleared the score by 6.6 points and pushed SWE-bench Verified past 87%. OpenAI's next move is widely expected to be GPT-5.5 or a Codex-specific variant, but as of this writing (April 17, 2026) GPT-5.4 is still the latest generally available flagship.
"+11% on SWE-bench Pro is gonna be a nice jump before 5 drops" — pdantix06, r/singularity (115 upvotes)
2. How We Tested
This comparison combines two types of data. First, the published benchmarks from Anthropic, OpenAI, and third-party evaluators. Second, a structured sweep of hands-on developer reports from Reddit threads between March 5 (GPT-5.4 launch) and April 17, 2026.
Benchmark Verification
Cross-referenced Anthropic and OpenAI's cited scores against Artificial Analysis, Vellum AI benchmark reproductions, and independent community tests. SWE-bench Pro and Verified numbers match vendor claims within rounding.
Reddit Sweep (r/ClaudeAI, r/OpenAI, r/LocalLLaMA, r/singularity)
Pulled comment threads from seven posts spanning the Opus 4.7 announcement through the first 24 hours of hands-on developer use, plus month-old GPT-5.4 retrospectives. Filtered for comments from users running both subscriptions ($100 Claude Max + $200 ChatGPT Pro tier) rather than one-shot opinions.
Coding Task Coverage
Reddit reports span TypeScript, Python, Rust, C/C++ and LISP codebases. Use cases include multi-file refactoring, edge-case debugging, policy document generation, data analysis in spreadsheets, and long-running agentic sessions. Quotes cited throughout this article link to the original threads.
Skepticism Applied
Every benchmark has a weakness — either the task distribution favors one model's training, or the evaluator has a relationship with a vendor. Where possible we note who ran the benchmark and what incentive they had.
Neither OpenAI nor Anthropic provided sponsored access or API credits for this comparison. All Reddit quotes are from public threads.
3. Benchmark Head-to-Head
Benchmarks never tell the full story, but they set the baseline. The table below covers the six metrics both vendors publish or where independent evaluators have confirmed scores for both models.
| Benchmark | Opus 4.7 | GPT-5.4 | Winner |
|---|---|---|---|
| SWE-bench Pro (coding) | 64.3% | 57.7% | Opus +6.6pt |
| SWE-bench Verified | 87.6% | ~74% | Opus +13pt |
| OSWorld (computer use) | ~73% | 75% | GPT-5.4 +2pt |
| GDPval (knowledge work) | Not published | 83% | GPT-5.4 |
| Agentic reasoning (multi-step) | +14% vs 4.6 | Baseline | Opus |
| Tool-call error rate | ~1/3 of 4.6 | -18% vs 5.2 | Opus |
| Context window | 200K (1M beta) | 1M standard | GPT-5.4 |
The shape of the result is clear: Opus 4.7 is the stronger coding and agentic model on benchmarks, GPT-5.4 keeps the edge on computer-use, general knowledge work, and standard 1M context. The Reddit user Yweain made a point worth flagging on r/singularity: "Considering that judging by this benchmark Gemini 3.1 Pro is on par with Opus 4.6, I feel like this benchmark is pretty not great." Healthy skepticism — but the pattern of SWE-bench, tool-call reduction, and agentic reasoning all moving in the same direction suggests the gains are real even if any one number is noisy.
4. Pricing and Token Reality
On paper, Opus 4.7 holds API pricing at $5 per million input tokens and $25 per million output tokens — matching Opus 4.6. The catch is buried in the release notes: Anthropic shipped a new tokenizer and deeper extended-thinking defaults, meaning the same English input can map to 1.0 to 1.35x more tokens. Factor that in and real per-query cost often drifts 10-20% higher than a naive $/MTok comparison implies.
| Tier | Claude Opus 4.7 | GPT-5.4 |
|---|---|---|
| Consumer (entry) | Claude Pro $20/mo | ChatGPT Plus $20/mo |
| Power user | Max 5x $100/mo | ChatGPT Pro $200/mo |
| API input | $5 / MTok | ~$1.25 / MTok |
| API output | $25 / MTok | ~$10 / MTok |
| Batch discount | 50% off | 50% off |
On per-token API cost, GPT-5.4 is roughly 3-4x cheaper than Opus 4.7 for both input and output. That delta matters more than it used to — GPT-5.4 has a 1M token context window by default, so long-context workloads that previously forced you into Opus for the extended context are now cost-competitive on OpenAI.
A small but real signal: Reddit user East-Armadillo-1166 posted that "Opus 4.7 is available on v0 and cheaper than 4.5," which tracks with Anthropic's partner-pricing tier being slightly discounted against earlier Opus versions through some reseller channels. If you're on a platform that exposes multiple Opus versions side-by-side, compare the per-request cost before defaulting to 4.7.
5. What Developers Actually Say
Benchmarks are averages across curated task sets. Developer workflows are not curated. Here are the comments we kept coming back to — from users who pay for both subscriptions and use them daily on production code.
"I have $100 subscription for Claude and $200 for Codex. Codex with GPT-5.4 works better for finding edge cases and solving complex design. Claude is better at understanding what I actually want."
— jbcraigs, r/OpenAI (82 upvotes)
"Claude is the better strategist. Codex is the better executer, reviewer, and fact checker. It also handles long context better particularly in the Codex app."
— Mammoth_Doctor_7688, r/OpenAI
"Codex 5.4 was a huge improvement due to the 1M tokens context window (which finally matches what Claude had). These things change all the time."
— -Sliced-, r/OpenAI (7 upvotes)
"Agentic search getting worse?"
— sunstersun, r/singularity (61 upvotes, on Opus 4.7 benchmarks thread)
"For my use cases with C, C++, Rust, LISP and maths, the best results I get are with GPT and Gemini 3.1 Pro. Claude feels lacking there."
— muyuu, r/LocalLLaMA
"They both find stuff wrong with the other every single time I ask. But if I'm willing to do everything twice the output is phenomenal."
— New_Jaguar_9104, r/OpenAI
The pattern: developers who actually run both don't treat this as a winner-take-all decision. Claude gets used as the architect and intent-understanding layer, GPT-5.4's Codex gets used as the reviewer and edge-case hunter. The two-model workflow is mentioned consistently enough across threads that it reads as a genuine emerging best practice rather than fence-sitting.
One divergent data point worth keeping in view: on r/LocalLLaMA and parts of r/singularity there's an active thread about Anthropic allegedly "dumbing down" the model for non-government users. The evidence cited is a GitHub discussion rather than reproducible benchmarks, and no independent test has confirmed a regression. We include the signal because if you're planning to migrate production workflows, it's worth running your own golden-prompt suite before and after a model swap.
6. Where Each Model Wins
Claude Opus 4.7 wins when:
- ✓You're doing multi-file refactoring or a large architectural change — Opus 4.7's agentic reasoning bump shows up most on long-horizon tasks
- ✓You need the model to understand vague intent or translate product requirements into code
- ✓You're using Claude Code with subagents or routines — tool-call reliability matters here more than raw benchmark score
- ✓You're writing policy documents, technical guides, or long-form prose where tone consistency matters
- ✓You already have a Claude Pro / Max subscription and want the $20-100/month flat rate instead of metered API
GPT-5.4 wins when:
- ✓You need computer-use automation — OSWorld 75% is genuinely ahead and Opus 4.7 didn't close this gap
- ✓You're cost-sensitive on high-volume API calls — GPT-5.4 is roughly 3-4x cheaper per token
- ✓Your workflow depends on Codex — the Codex + 1M token context combination is the single most-praised GPT-5.4 use case
- ✓You work in lower-level languages: C, C++, Rust, LISP, or heavy mathematical work
- ✓You want spreadsheet modeling or knowledge-work tasks that GDPval covers (GPT-5.4 leads at 83%)
If your use case crosses both columns, the Reddit-tested pattern is: Opus as primary for writing, GPT-5.4's Codex as secondary for review. Claude Code's multi-agent setup makes this less friction-heavy than it was even six months ago.
7. Honest Limitations of Opus 4.7
Every model release generates a wave of enthusiasm. The ones worth flagging now, after the first day of real use:
- Agentic search regression concerns. A top comment on the r/singularity benchmark thread flagged that agentic search may have gotten worse, not better. Anthropic hasn't addressed this publicly. If your workflow depends on Claude doing web-grounded research, benchmark against your own prompts before committing.
- Token counts inflated by new tokenizer. Same prompts can cost 1.0-1.35x more in tokens than Opus 4.6. For high-volume API workloads this compounds fast — a $500/mo line item can become $650 without any behavior change on your end.
- Lower-level language coding still not its strength. Reddit reports on r/LocalLLaMA from users working in C, C++, Rust, LISP, and heavy maths continue to prefer GPT-5.4 and Gemini 3.1 Pro. Opus 4.7 didn't change this pattern.
- Model-dumbing allegations. A thread on r/LocalLLaMA references an ongoing GitHub discussion where some users claim Anthropic is reducing model capability for non-enterprise customers. Not independently verified. Included because if true it would matter, and because it's worth running your own golden prompts on a schedule.
- Context window still behind. 200K standard (1M beta with higher pricing) vs GPT-5.4's 1M standard. For whole-codebase prompts you'll still reach for GPT-5.4 or Gemini.
8. Migration Notes
If you're already on Opus 4.6 via the Anthropic API, the migration is mechanical. Update your model string to claude-opus-4-7 (or the dated variant like claude-opus-4-7-20260416). Consumer Claude.ai users get 4.7 automatically on Pro / Max / Team / Enterprise plans.
Opus 4.7 is available across the major cloud platforms from day one: Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry all have it live. GitHub Copilot users on Pro+, Business, and Enterprise tiers can pick Opus 4.7 as their coding model.
One practical tip: before swapping production workflows, rerun your own evaluation prompts. The benchmark improvements are real but any particular prompt can regress. The strongest advice across the Reddit threads: "run both for a week, compare the PRs each one opens." That's the test that actually reflects your codebase.
9. FAQ
Is Claude Opus 4.7 better than GPT-5.4?
On coding benchmarks, yes — Opus 4.7 scores 64.3% on SWE-bench Pro against GPT-5.4's 57.7%, and hits 87.6% on SWE-bench Verified. But GPT-5.4 still leads on OSWorld computer use (75%) and knowledge-work tasks (GDPval 83%). Developers on Reddit consistently report Claude is better at understanding intent and architectural strategy, while Codex with GPT-5.4 is better at finding edge cases and acting as a reviewer. Most serious users run both.
How much does Claude Opus 4.7 cost?
Opus 4.7 API pricing holds at $5 per million input tokens and $25 per million output tokens — the same as Opus 4.6. However, the new tokenizer and deeper extended-thinking defaults mean the same prompt can map to 1.0 to 1.35x more tokens in practice. Consumer access: Claude Pro $20/month, Max 5x $100/month, Max 20x $200/month. Batch API offers 50% off.
When was Claude Opus 4.7 released?
April 16, 2026, roughly six weeks after GPT-5.4 shipped on March 5, 2026. The release is a capability refresh on the Opus 4.6 base, with headline improvements in agentic reasoning, tool-call reliability, and 3x higher image input resolution.
Should I switch from GPT-5.4 to Opus 4.7?
Not reflexively. For pure coding — especially long multi-file refactors and agentic work in Claude Code — Opus 4.7 is worth testing. For existing Codex workflows, 1M context needs, or OSWorld-style computer-use automation, GPT-5.4 holds ground. The Reddit consensus is to run both rather than pick one.
Has Opus 4.7 regressed in any areas?
Community reports on r/singularity flagged potential regressions in agentic search, and an r/LocalLLaMA thread references a GitHub discussion about model behavior changes. Anthropic has not publicly confirmed any regression, and the benchmarks show broad improvement. Worth running your own prompts before migrating production workflows.
10. Final Verdict
Opus 4.7 is the best coding model generally available as of April 17, 2026 — not because any one number blows the competition out, but because the SWE-bench improvements, tool-call error reduction, and agentic reasoning bump all move in the same direction. For shops that do serious engineering work with AI, this is the model to default to for architecture, multi-file refactors, and agent-driven workflows.
GPT-5.4 is not dethroned. It keeps meaningful leads on computer-use automation (OSWorld 75%), general knowledge-work tasks (GDPval 83%), standard 1M context windows, and per-token cost. If your workflow already depends on Codex or you work primarily in lower-level languages, staying on GPT-5.4 is rational. The Reddit consensus — run both, use each where it's strongest — holds.
The six-week window where GPT-5.4 sat alone at the top is closed. The interesting question now is how fast OpenAI responds and whether GPT-5.5 or a Codex-specific variant arrives before Opus 5.
Quick Recommendation
- • Default coding model — Opus 4.7
- • Edge-case reviewer / second opinion — GPT-5.4 via Codex
- • Computer-use automation — GPT-5.4
- • Cost-sensitive high-volume API — GPT-5.4 or Gemini 3.1 Pro
- • Low-level systems code (C/Rust/LISP) — GPT-5.4 or Gemini 3.1 Pro