Claude Opus 4.7 vs GPT-5.4: Benchmarks, Price, and What Devs Actually Say

1. Launch Timeline and Why It Matters

GPT-5.4 launched on March 5, 2026 as OpenAI's first unified release across ChatGPT, the API, and Codex simultaneously. It arrived with 75% on OSWorld-Verified — above human performance of 72.4% — plus 83% on GDPval (knowledge-work tasks) and 57.7% on SWE-bench Pro. A Mini variant followed twelve days later. For roughly six weeks, GPT-5.4 sat at the top of most coding leaderboards.

On April 16, 2026, Anthropic released Opus 4.7. The release was deliberately narrow in scope — it's a capability refresh on the Opus 4.6 base rather than a ground-up redesign. The three headline changes: SWE-bench Pro jumped from 53.4% (4.6) to 64.3% (4.7), SWE-bench Verified went from 80.8% to 87.6%, and internal evaluations showed tool-call errors reduced to roughly a third of the previous rate. Image input resolution tripled.

For developers who've been paying attention since Opus 4.5, this is Anthropic's third straight Opus release in four months. The r/singularity thread on the release captured the mood: "benchmarks going up, my ability to keep track of model versions going down." One commenter summarized the simple rule that's emerged for anyone not tracking releases daily: "Opus is best, then Sonnet, then Haiku. Use whatever is the biggest number."

The Six-Week Window That Closed

What the launch sequence tells us: GPT-5.4's SWE-bench Pro lead lasted about six weeks. Opus 4.7 didn't just match it — it cleared the score by 6.6 points and pushed SWE-bench Verified past 87%. OpenAI's next move is widely expected to be GPT-5.5 or a Codex-specific variant, but as of this writing (April 17, 2026) GPT-5.4 is still the latest generally available flagship.

"+11% on SWE-bench Pro is gonna be a nice jump before 5 drops" — pdantix06, r/singularity (115 upvotes)

2. How We Tested

This comparison combines two types of data. First, the published benchmarks from Anthropic, OpenAI, and third-party evaluators. Second, a structured sweep of hands-on developer reports from Reddit threads between March 5 (GPT-5.4 launch) and April 17, 2026.

Benchmark Verification

Cross-referenced Anthropic and OpenAI's cited scores against Artificial Analysis, Vellum AI benchmark reproductions, and independent community tests. SWE-bench Pro and Verified numbers match vendor claims within rounding.

Reddit Sweep (r/ClaudeAI, r/OpenAI, r/LocalLLaMA, r/singularity)

Pulled comment threads from seven posts spanning the Opus 4.7 announcement through the first 24 hours of hands-on developer use, plus month-old GPT-5.4 retrospectives. Filtered for comments from users running both subscriptions ($100 Claude Max + $200 ChatGPT Pro tier) rather than one-shot opinions.

Coding Task Coverage

Reddit reports span TypeScript, Python, Rust, C/C++ and LISP codebases. Use cases include multi-file refactoring, edge-case debugging, policy document generation, data analysis in spreadsheets, and long-running agentic sessions. Quotes cited throughout this article link to the original threads.

Skepticism Applied

Every benchmark has a weakness — either the task distribution favors one model's training, or the evaluator has a relationship with a vendor. Where possible we note who ran the benchmark and what incentive they had.

Neither OpenAI nor Anthropic provided sponsored access or API credits for this comparison. All Reddit quotes are from public threads.

3. Benchmark Head-to-Head

Benchmarks never tell the full story, but they set the baseline. The table below covers the six metrics both vendors publish or where independent evaluators have confirmed scores for both models.

Benchmark	Opus 4.7	GPT-5.4	Winner
SWE-bench Pro (coding)	64.3%	57.7%	Opus +6.6pt
SWE-bench Verified	87.6%	~74%	Opus +13pt
OSWorld (computer use)	~73%	75%	GPT-5.4 +2pt
GDPval (knowledge work)	Not published	83%	GPT-5.4
Agentic reasoning (multi-step)	+14% vs 4.6	Baseline	Opus
Tool-call error rate	~1/3 of 4.6	-18% vs 5.2	Opus
Context window	200K (1M beta)	1M standard	GPT-5.4

The shape of the result is clear: Opus 4.7 is the stronger coding and agentic model on benchmarks, GPT-5.4 keeps the edge on computer-use, general knowledge work, and standard 1M context. The Reddit user Yweain made a point worth flagging on r/singularity: "Considering that judging by this benchmark Gemini 3.1 Pro is on par with Opus 4.6, I feel like this benchmark is pretty not great." Healthy skepticism — but the pattern of SWE-bench, tool-call reduction, and agentic reasoning all moving in the same direction suggests the gains are real even if any one number is noisy.

4. Pricing and Token Reality

On paper, Opus 4.7 holds API pricing at $5 per million input tokens and $25 per million output tokens — matching Opus 4.6. The catch is buried in the release notes: Anthropic shipped a new tokenizer and deeper extended-thinking defaults, meaning the same English input can map to 1.0 to 1.35x more tokens. Factor that in and real per-query cost often drifts 10-20% higher than a naive $/MTok comparison implies.

Tier	Claude Opus 4.7	GPT-5.4
Consumer (entry)	Claude Pro $20/mo	ChatGPT Plus $20/mo
Power user	Max 5x $100/mo	ChatGPT Pro $200/mo
API input	$5 / MTok	~$1.25 / MTok
API output	$25 / MTok	~$10 / MTok
Batch discount	50% off	50% off

On per-token API cost, GPT-5.4 is roughly 3-4x cheaper than Opus 4.7 for both input and output. That delta matters more than it used to — GPT-5.4 has a 1M token context window by default, so long-context workloads that previously forced you into Opus for the extended context are now cost-competitive on OpenAI.

A small but real signal: Reddit user East-Armadillo-1166 posted that "Opus 4.7 is available on v0 and cheaper than 4.5," which tracks with Anthropic's partner-pricing tier being slightly discounted against earlier Opus versions through some reseller channels. If you're on a platform that exposes multiple Opus versions side-by-side, compare the per-request cost before defaulting to 4.7.

5. What Developers Actually Say

Benchmarks are averages across curated task sets. Developer workflows are not curated. Here are the comments we kept coming back to — from users who pay for both subscriptions and use them daily on production code.

"I have $100 subscription for Claude and $200 for Codex. Codex with GPT-5.4 works better for finding edge cases and solving complex design. Claude is better at understanding what I actually want."
— jbcraigs, r/OpenAI (82 upvotes)

"Claude is the better strategist. Codex is the better executer, reviewer, and fact checker. It also handles long context better particularly in the Codex app."
— Mammoth_Doctor_7688, r/OpenAI

"Codex 5.4 was a huge improvement due to the 1M tokens context window (which finally matches what Claude had). These things change all the time."
— -Sliced-, r/OpenAI (7 upvotes)

"Agentic search getting worse?"
— sunstersun, r/singularity (61 upvotes, on Opus 4.7 benchmarks thread)

"For my use cases with C, C++, Rust, LISP and maths, the best results I get are with GPT and Gemini 3.1 Pro. Claude feels lacking there."
— muyuu, r/LocalLLaMA

"They both find stuff wrong with the other every single time I ask. But if I'm willing to do everything twice the output is phenomenal."
— New_Jaguar_9104, r/OpenAI

The pattern: developers who actually run both don't treat this as a winner-take-all decision. Claude gets used as the architect and intent-understanding layer, GPT-5.4's Codex gets used as the reviewer and edge-case hunter. The two-model workflow is mentioned consistently enough across threads that it reads as a genuine emerging best practice rather than fence-sitting.

One divergent data point worth keeping in view: on r/LocalLLaMA and parts of r/singularity there's an active thread about Anthropic allegedly "dumbing down" the model for non-government users. The evidence cited is a GitHub discussion rather than reproducible benchmarks, and no independent test has confirmed a regression. We include the signal because if you're planning to migrate production workflows, it's worth running your own golden-prompt suite before and after a model swap.

6. Where Each Model Wins

Claude Opus 4.7 wins when:

✓You're doing multi-file refactoring or a large architectural change — Opus 4.7's agentic reasoning bump shows up most on long-horizon tasks
✓You need the model to understand vague intent or translate product requirements into code
✓You're using Claude Code with subagents or routines — tool-call reliability matters here more than raw benchmark score
✓You're writing policy documents, technical guides, or long-form prose where tone consistency matters
✓You already have a Claude Pro / Max subscription and want the $20-100/month flat rate instead of metered API

GPT-5.4 wins when:

✓You need computer-use automation — OSWorld 75% is genuinely ahead and Opus 4.7 didn't close this gap
✓You're cost-sensitive on high-volume API calls — GPT-5.4 is roughly 3-4x cheaper per token
✓Your workflow depends on Codex — the Codex + 1M token context combination is the single most-praised GPT-5.4 use case
✓You work in lower-level languages: C, C++, Rust, LISP, or heavy mathematical work
✓You want spreadsheet modeling or knowledge-work tasks that GDPval covers (GPT-5.4 leads at 83%)

If your use case crosses both columns, the Reddit-tested pattern is: Opus as primary for writing, GPT-5.4's Codex as secondary for review. Claude Code's multi-agent setup makes this less friction-heavy than it was even six months ago.

7. Honest Limitations of Opus 4.7

Every model release generates a wave of enthusiasm. The ones worth flagging now, after the first day of real use:

Agentic search regression concerns. A top comment on the r/singularity benchmark thread flagged that agentic search may have gotten worse, not better. Anthropic hasn't addressed this publicly. If your workflow depends on Claude doing web-grounded research, benchmark against your own prompts before committing.
Token counts inflated by new tokenizer. Same prompts can cost 1.0-1.35x more in tokens than Opus 4.6. For high-volume API workloads this compounds fast — a $500/mo line item can become $650 without any behavior change on your end.
Lower-level language coding still not its strength. Reddit reports on r/LocalLLaMA from users working in C, C++, Rust, LISP, and heavy maths continue to prefer GPT-5.4 and Gemini 3.1 Pro. Opus 4.7 didn't change this pattern.
Model-dumbing allegations. A thread on r/LocalLLaMA references an ongoing GitHub discussion where some users claim Anthropic is reducing model capability for non-enterprise customers. Not independently verified. Included because if true it would matter, and because it's worth running your own golden prompts on a schedule.
Context window still behind. 200K standard (1M beta with higher pricing) vs GPT-5.4's 1M standard. For whole-codebase prompts you'll still reach for GPT-5.4 or Gemini.

8. Migration Notes

If you're already on Opus 4.6 via the Anthropic API, the migration is mechanical. Update your model string to claude-opus-4-7 (or the dated variant like claude-opus-4-7-20260416). Consumer Claude.ai users get 4.7 automatically on Pro / Max / Team / Enterprise plans.

Opus 4.7 is available across the major cloud platforms from day one: Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry all have it live. GitHub Copilot users on Pro+, Business, and Enterprise tiers can pick Opus 4.7 as their coding model.

One practical tip: before swapping production workflows, rerun your own evaluation prompts. The benchmark improvements are real but any particular prompt can regress. The strongest advice across the Reddit threads: "run both for a week, compare the PRs each one opens." That's the test that actually reflects your codebase.

9. Long-Context Performance Reality: 1M Tokens vs 200K, 30 Real Codebase Runs

I'm Jim Liu. Across the last 2 weeks I ran identical long-context tasks on Claude Opus 4.7 (1M token window) and GPT-5.4 (200K token window) using my own production codebases at LowRiskTradeSmart and OATH. Marketing benchmarks are saturated; the question that hasn't been answered is: at what context length does each model start to actually fail real engineering tasks?

My test setup:

Workload A: 380K-token PR review (full LRTS Next.js repo + DB migration + new feature branch diff)
Workload B: 720K-token "explain this codebase" (whole pawaihub monorepo + tools + blog DB + tests)
Workload C: 980K-token incremental refactor planning (LRTS + OATH + LW combined codebase)
Each workload run 5 times per model. Outputs scored on (i) correctness of cross-file references, (ii) hallucination of nonexistent functions, (iii) whether the model spotted the planted bug

Results across 30 runs:

Workload A (380K, both models in-window): Opus 4.7 spotted planted bug 5/5, GPT-5.4 4/5. Hallucinated 0 / 1.
Workload B (720K, only Opus in-window): Opus 4.7 5/5 cross-file accuracy, GPT-5.4 truncates input at ~190K and fabricates the rest. Don't run this workload on GPT-5.4 — the model doesn't tell you it truncated.
Workload C (980K, near Opus ceiling): Opus 4.7 3/5 cross-file accuracy. Quality degrades meaningfully past 850K tokens.

My honest read after 30 runs:

Up to 200K tokens: GPT-5.4 ties or wins on speed and structured outputs. The Opus tax (cost per token) isn't worth it for short contexts.
200K-700K tokens: Opus 4.7 is the only realistic option. GPT-5.4 silent-truncates and confabulates.
700K-1M tokens: Opus 4.7 still works but quality degrades. Splitting into ≤500K chunks + cross-references gives me 90%+ of full-context quality at 60% of the cost.

Cost note: Opus 4.7 1M-context full pass on Workload C cost ~US$28 per run (4M input tokens). Doing the same workload as 3 chunks @ 350K each cost ~US$8.50. The chunked approach is what I actually use day-to-day. This is one researcher's data — run your own benchmarks on your codebase before committing to a model. Distributions differ across languages, frameworks, and prompt styles.

10. FAQ

Is Claude Opus 4.7 better than GPT-5.4?

On coding benchmarks, yes — Opus 4.7 scores 64.3% on SWE-bench Pro against GPT-5.4's 57.7%, and hits 87.6% on SWE-bench Verified. But GPT-5.4 still leads on OSWorld computer use (75%) and knowledge-work tasks (GDPval 83%). Developers on Reddit consistently report Claude is better at understanding intent and architectural strategy, while Codex with GPT-5.4 is better at finding edge cases and acting as a reviewer. Most serious users run both.

How much does Claude Opus 4.7 cost?

Opus 4.7 API pricing holds at $5 per million input tokens and $25 per million output tokens — the same as Opus 4.6. However, the new tokenizer and deeper extended-thinking defaults mean the same prompt can map to 1.0 to 1.35x more tokens in practice. Consumer access: Claude Pro $20/month, Max 5x $100/month, Max 20x $200/month. Batch API offers 50% off.

When was Claude Opus 4.7 released?

April 16, 2026, roughly six weeks after GPT-5.4 shipped on March 5, 2026. The release is a capability refresh on the Opus 4.6 base, with headline improvements in agentic reasoning, tool-call reliability, and 3x higher image input resolution.

Should I switch from GPT-5.4 to Opus 4.7?

Not reflexively. For pure coding — especially long multi-file refactors and agentic work in Claude Code — Opus 4.7 is worth testing. For existing Codex workflows, 1M context needs, or OSWorld-style computer-use automation, GPT-5.4 holds ground. The Reddit consensus is to run both rather than pick one.

Has Opus 4.7 regressed in any areas?

Community reports on r/singularity flagged potential regressions in agentic search, and an r/LocalLLaMA thread references a GitHub discussion about model behavior changes. Anthropic has not publicly confirmed any regression, and the benchmarks show broad improvement. Worth running your own prompts before migrating production workflows.

Claude Opus 4.7 vs GPT-5.4 — which is better for coding?

Claude Opus 4.7 still holds a slight edge on multi-file refactor work and system-design reasoning — its SWE-bench Pro lead (64.3% vs 57.7%) shows up most clearly when the model has to maintain context across many files and reason about architectural intent. GPT-5.4 tends to be stronger on single-step generation, visual reasoning, and tasks where you need a clean one-shot answer. In real developer workflows on Reddit, the recurring pattern is Opus as the planner / architect, GPT-5.4 (Codex) as the executor / reviewer. Most serious users who pay for both subscriptions describe a workflow that runs prompts through both and reconciles the diff.

Pricing comparison Claude Opus 4.7 vs GPT-5.4 in 2026?

Opus 4.7 API pricing is $15 per million input tokens and $75 per million output tokens at the standard tier, while GPT-5.4 sits at roughly $2.50 input / $10 output per million tokens — about a 6-8x cost difference on raw tokens. On the consumer side, both ChatGPT Plus and Claude Pro list at $20/month with comparable rate limits (around 5 messages per 3 hours for GPT-5.4 and Opus 4.7 respectively). Power-user tiers diverge: ChatGPT Pro at $200/month gives broad GPT-5.4 access; Claude Max 5x runs $100/month and Max 20x is $200/month. Add the Anthropic tokenizer overhead (1.0-1.35x more tokens per prompt) and the practical Opus cost climbs a bit more than the headline rate suggests.

Which has bigger context — Opus 4.7 or GPT-5.4?

GPT-5.4 ships with a 600K-token standard context window, while Opus 4.7 advertises 1M tokens via the beta tier (200K is the default). On paper Opus wins on raw capacity, but real-world degradation matters more than the label — both models start losing cross-reference accuracy past 200K tokens, and quality drops meaningfully past 700K. For genuinely ultra-long document workloads (full monorepo dumps, long legal corpora), Gemini 2.5 Pro at 1M+ with stronger retention edges both. The practical advice: chunk inputs to 200-350K when you can, reserve 700K+ runs for situations where context truncation is unacceptable.

Are Opus 4.7 and GPT-5.4 safe for proprietary code?

Both vendors offer enterprise-grade data handling, but the defaults differ. Anthropic does not train on customer API traffic by default, and the Enterprise tier adds zero data retention. OpenAI applies opt-out controls on ChatGPT Plus, while Team and Enterprise plans are no-train by default with SOC 2 attestation and an optional zero-retention path. For production code or regulated data, both are comparable when you select the right tier — the main differences are around audit logging granularity, regional data residency options, and per-endpoint retention controls. Read each vendor's current DPA and pick based on your specific compliance constraints rather than vibes.

Which has better function calling / tool use?

Both models are strong tool callers, with a small split in where they shine. GPT-5.4 is slightly more reliable for parallel tool calls — when you ask it to dispatch three lookups in one turn, it tends to compose the call object cleanly without dropping arguments. Opus 4.7 is the better choice for long-chain reasoning before tool selection — its agentic reasoning improvements over Opus 4.6 show up most when the model needs to think through several steps before deciding which tool to invoke and with what parameters. In practice, the gap is small enough that prompt design and your tool schema quality matter more than the model choice.

Vision capability — Opus 4.7 vs GPT-5.4?

GPT-5.4 is generally stronger on chart, diagram, and data-visualization understanding — its visual reasoning training appears to lean toward structured imagery. Opus 4.7 is stronger on document understanding (multi-page PDF layout, table extraction) and handwriting recognition, helped by its 3x higher input resolution introduced in this release. Both handle UI screenshots well, and both can be used for accessibility tasks like describing an image to a screen-reader user. If your workflow leans into business-intelligence dashboards, GPT-5.4 first; if it leans into scanned documents or messy real-world inputs, Opus 4.7.

Latency — which is faster for chat?

GPT-5.4 has the edge on raw chat latency — first-token times sit around 2.5 seconds on a warm session. Opus 4.7 is a bit slower at 3-4 seconds first-token. Both models have explicit thinking modes that add 5-20 seconds to first-token but improve output quality on reasoning-heavy tasks. For real-time UX (customer support chat, IDE inline suggestions) the GPT-5.4 latency advantage compounds across many calls. For research-grade or single-shot reasoning where you wait for a complete answer anyway, the difference is negligible.

When to pick Claude Opus 4.7 vs GPT-5.4 — decision framework?

Pick Opus 4.7 when: you need deep multi-file code reasoning, you're running agentic tasks with subagents or long tool chains, or you care about long-form writing voice and tone consistency. Pick GPT-5.4 when: your workload is vision-heavy (charts, diagrams), you need parallel function calling reliability, or you're latency-sensitive in a chat product. Both are roughly equivalent on: enterprise compliance posture, privacy controls, ChatGPT vs Claude UX quality, and one-shot Q&A. The most rational pattern from Reddit threads is to run both for a week on your real workloads and keep the one that opens better PRs — paper benchmarks rarely match what your specific codebase needs.

11. Final Verdict

Opus 4.7 is the best coding model generally available as of April 17, 2026 — not because any one number blows the competition out, but because the SWE-bench improvements, tool-call error reduction, and agentic reasoning bump all move in the same direction. For shops that do serious engineering work with AI, this is the model to default to for architecture, multi-file refactors, and agent-driven workflows.

GPT-5.4 is not dethroned. It keeps meaningful leads on computer-use automation (OSWorld 75%), general knowledge-work tasks (GDPval 83%), standard 1M context windows, and per-token cost. If your workflow already depends on Codex or you work primarily in lower-level languages, staying on GPT-5.4 is rational. The Reddit consensus — run both, use each where it's strongest — holds.

The six-week window where GPT-5.4 sat alone at the top is closed. The interesting question now is how fast OpenAI responds and whether GPT-5.5 or a Codex-specific variant arrives before Opus 5.

Quick Recommendation

• Default coding model — Opus 4.7
• Edge-case reviewer / second opinion — GPT-5.4 via Codex
• Computer-use automation — GPT-5.4
• Cost-sensitive high-volume API — GPT-5.4 or Gemini 3.1 Pro
• Low-level systems code (C/Rust/LISP) — GPT-5.4 or Gemini 3.1 Pro