What does "83% superhuman professional tasks" mean for GPT-5.4?

OpenAI's internal evaluation measures GPT-5.4's performance against human expert baselines across professional domains: medical, legal, coding, scientific reasoning, and financial analysis. "83% superhuman" means GPT-5.4 outperformed human professionals in 83% of a curated task set across those domains. This is a controlled benchmark, not a claim that the model outperforms all experts in production conditions.

GPT-5.4 for Developers: API Pricing, Computer Use, and SWE-bench 80%

Q: What is the GPT-5.4 API price?

GPT-5.4 API pricing is $2.50 per million input tokens and $15 per million output tokens. This is roughly half the cost of Claude Opus 4.6 ($15/$75) for input-heavy workloads. Cached inputs are $1.25/M. Batch API discounts bring input costs down to $1.25/M for non-real-time use cases. Pricing is subject to change — always verify at platform.openai.com/docs/pricing.

Q: How does GPT-5.4 computer use compare to Claude Opus 4.6?

GPT-5.4 scores 75% on OSWorld, the standard computer use benchmark for GUI automation tasks. Claude Opus 4.6 was the previous leader in computer use capabilities. GPT-5.4's native computer use is delivered through the Responses API with built-in tools for browser control, file system interaction, and app navigation — no third-party orchestration required.

Q: Does GPT-5.4 support a 1 million token context window?

Yes. GPT-5.4 supports a 1 million token context window in the API, allowing roughly 750,000 words or 30,000 lines of dense code in a single request. The full 1M context is available via the Responses API. Note that very long contexts increase per-request cost proportionally, so caching and chunking strategies matter for cost control.

Q: Should developers switch from Claude Opus 4.6 to GPT-5.4?

It depends on your use case. GPT-5.4 wins on raw coding benchmarks (80% vs 72–77% SWE-bench) and API cost ($2.50 vs $15 input). Claude Opus 4.6 wins on multi-agent tool use, MCP ecosystem maturity, and safety/refusal behavior. For agentic coding workflows already built on Claude tools (Claude Code, MCP integrations), switching requires re-tooling. For greenfield API projects prioritizing coding performance at lower cost, GPT-5.4 is worth evaluating.

What Changed Between GPT-5.3 and GPT-5.4

GPT-5.4 is not a frontier breakthrough in the way GPT-4 or GPT-4o were — it is a disciplined step-improvement on top of GPT-5.3. The changes that matter for developers are: a doubled context window (512K → 1M tokens), native computer use shipped via the Responses API, revised API pricing that halves input token costs, and improved coding benchmark scores that push SWE-bench Verified past 80%. For the consumer-facing side — ChatGPT Plus pricing, multimodal features, and how it compares to Claude Pro — see our GPT-5.4 review.

The 83% superhuman professional tasks figure comes from OpenAI's internal evaluation suite — a cross-domain benchmark covering medical diagnosis reasoning, legal document analysis, scientific problem sets, financial modeling tasks, and software engineering. It is a controlled comparison against human expert baselines, not a general capability claim.

From a developer perspective, the API pricing change is arguably the most impactful update. Input tokens dropped from approximately $5/M (GPT-5.3) to $2.50/M. For applications that send large system prompts, documents, or long conversation histories with every request, this is a meaningful cost reduction.

How We Tested

We evaluated GPT-5.4 via the OpenAI Responses API over two weeks in March 2026, alongside Claude Opus 4.6 and Gemini 3.1 Pro, using the same task sets for each model:

Code generation: 30 tasks — component generation, API endpoint implementation, bug fixes, and test writing across TypeScript and Python
Long-context retrieval: 10 documents in the 200K–900K token range, with targeted questions requiring extraction and reasoning across the full context
Computer use (GUI automation): 15 tasks using the built-in computer use tools — browser navigation, form filling, file management, and app interactions
API cost tracking: Measured token usage and costs per task across all three models using identical prompts

Benchmark references: SWE-bench Verified scores from the official SWE-bench leaderboard. OSWorld scores from the OSWorld benchmark site. API pricing verified at platform.openai.com at time of writing.

API Pricing Breakdown

The headline: GPT-5.4 input tokens cost $2.50/M — roughly 83% cheaper than Claude Opus 4.6's $15/M input and substantially cheaper than Gemini 3.1 Pro's $3.50/M. Output tokens are $15/M, on par with Opus 4.6 output.

Model	Input ($/M tokens)	Output ($/M tokens)	Cached Input	Context
GPT-5.4	$2.50	$15.00	$1.25	1M tokens
Claude Opus 4.6	$15.00	$75.00	$1.50	200K tokens
Gemini 3.1 Pro	$3.50	$10.50	$0.875	2M tokens
GPT-5.4 Batch API	$1.25	$7.50	—	1M tokens

The cost advantage shifts significantly based on input-to-output ratio. An application that sends 100K token prompts and receives 1K token responses is heavily input-weighted — at that ratio, GPT-5.4's $2.50/M input cost versus Opus 4.6's $15/M input cost delivers a 6x saving on input spend. But an application generating long-form outputs (reports, code files, documentation) will see the $15/M output cost erode the advantage.

For batch processing use cases — document analysis, bulk code review, data extraction — the Batch API at $1.25/$7.50 makes GPT-5.4 one of the cheapest frontier model options available. Against Gemini 3.1 Pro's output pricing ($10.50/M), GPT-5.4 Batch still wins on output if generation volume is high.

SWE-bench 80%: What It Actually Means

SWE-bench Verified is the standard benchmark for evaluating AI coding agents on real software engineering tasks. It consists of 500 verified GitHub issues from popular open-source Python repositories — the model receives the repository state before the fix and must produce a patch that passes all relevant tests.

GPT-5.4's 80% score means it resolved 400 of the 500 issues autonomously without human guidance. For context: Claude Opus 4.6 with Claude Code sits at 72–77%, and Gemini 3.1 Pro's underlying model scores approximately 68%. A year ago, the state-of-the-art was around 50%.

What 80% means in practice: on typical bug fixes, targeted feature implementations, and isolated refactors, GPT-5.4 is reliable enough to handle autonomously with review. The remaining 20% failures tend to cluster around: tasks requiring architectural decisions the model cannot infer from code alone, issues requiring understanding of external system state (database schemas, API contracts), and tasks where the test suite itself has gaps.

In our code generation tests, GPT-5.4 produced working TypeScript component implementations on 27 of 30 tasks on the first pass. The three failures involved type inference across generics in complex utility types — the model's suggestions were logically correct but introduced TypeScript errors that required one or two rounds of correction. Claude Opus 4.6 scored 25/30 on the same tasks, with failures in similar edge-case areas.

Computer Use: OSWorld 75% and the Responses API

GPT-5.4's native computer use is the most architecturally significant change. It ships through the Responses API with built-in tool definitions for browser control, file system interaction, and application navigation — no separate orchestration layer required.

OSWorld is the standard benchmark for computer use agents. It tests 369 tasks across real desktop applications — file management, web browsing, spreadsheet manipulation, and cross-application workflows. GPT-5.4 scores 75% on the full benchmark. For comparison, Claude Opus 4.6 (which introduced computer use to the frontier) was evaluated at approximately 70% on OSWorld's comparable task set at launch.

Capability	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
OSWorld score	75%	~70%	~65%
Computer use API	Native (Responses API)	Native (Tools API)	Via extensions
Browser automation	Built-in	Built-in	3rd-party required
File system interaction	Built-in	Via MCP tools	Via extensions
Multi-step GUI tasks	Strong	Strong	Moderate

In our computer use testing, GPT-5.4 handled 11 of 15 tasks successfully. The four failures involved tasks requiring spatial reasoning about screen layout — clicking an element that only appeared after another element was in a specific state. This is a known weakness of current computer use implementations, not specific to GPT-5.4. Claude Opus 4.6 failed on similar spatial-dependency tasks. Both models struggle when UI state changes are driven by JavaScript with non-obvious timing.

For developers, the practical implication is that GPT-5.4 computer use is reliable enough for structured automation tasks — form filling, data extraction from GUIs, navigating well-defined web application workflows. It is not yet reliable enough for fully autonomous browsing of arbitrary websites where layout and interaction patterns are unpredictable.

1M Context Window in Practice

GPT-5.4's 1M token context window puts it alongside Gemini 3.1 Pro (2M tokens) as one of the few frontier models capable of ingesting entire codebases or large document collections in a single request. Claude Opus 4.6's 200K context is five times smaller.

In our long-context tests, GPT-5.4 successfully retrieved specific details from documents in the 700K–900K token range — though response quality degraded noticeably above 800K tokens compared to the 200K–400K range. This degradation is expected: all current models show some reduction in retrieval accuracy at the high end of their context windows.

For practical use cases:

Codebase analysis: Ingest 20,000–25,000 lines of code for architecture questions, cross-file dependency tracing, and documentation generation
Document processing: Full PDF books, regulatory filings, or legal contracts in a single pass
Long conversation histories: Multi-session context without manual summarization for customer support agents
Data extraction: Large CSV or JSON datasets processed in one request rather than chunked pipelines

The cost implication is worth modeling carefully. A 500K token input at $2.50/M costs $1.25 per request. At production volumes of 1,000 requests/day, that is $1,250/day in input costs alone. Caching reduces this — repeated system prompts or document context cached at $1.25/M halve the per-request cost for subsequent calls within the caching window.

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro

The three leading frontier models are now close enough in raw capability that the decision criteria for most developers are: cost structure, ecosystem fit, and specific benchmark performance on tasks matching your use case.

Metric	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-bench Verified	80%	72–77%	~68%
OSWorld (computer use)	75%	~70%	~65%
Context window	1M tokens	200K tokens	2M tokens
Input price ($/M)	$2.50	$15.00	$3.50
Output price ($/M)	$15.00	$75.00	$10.50
Native computer use	Yes	Yes	Extension only
MCP ecosystem	Growing	Mature	Growing
Multi-agent orchestration	Responses API	Claude Code + MCP	Vertex AI Agents
Structured outputs (JSON)	Strict mode	Strong	Strong
Open-source option	No	No	No

Claude Opus 4.6 still leads on output pricing for high-generation workloads — at $75/M output versus GPT-5.4's $15/M, the cost gap actually favors GPT-5.4. But Opus 4.6's MCP ecosystem, Claude Code integration, and multi-agent tooling (via Claude's agentic coding tools) remain ahead for complex agentic workflows. If you are building on top of Claude Code or using MCP integrations extensively, switching to GPT-5.4 API requires re-tooling that may not be worth the cost savings.

Gemini 3.1 Pro is the outlier: larger context window (2M tokens), lower output pricing ($10.50/M), but weaker on coding benchmarks and with computer use delivered through extensions rather than native API tooling. For applications that prioritize context capacity and output cost over coding accuracy, Gemini 3.1 Pro is worth evaluating. The OpenAI Codex agent is the most direct way to use GPT-5.4 for autonomous coding tasks, and our AI coding tools guide covers how all these models plug into actual developer workflows.

Real Downsides

GPT-5.4 is not the right choice for every use case. Here are the genuine limitations that came up in our testing.

Output pricing is unchanged from Opus 4.6

The $15/M output token price is the same as Claude Opus 4.6's output cost. For applications generating long outputs — full code files, detailed reports, documentation — the cost advantage of GPT-5.4's cheaper input pricing disappears if output volume is high. Gemini 3.1 Pro's $10.50/M output is cheaper for generation-heavy workloads.

Context quality degrades above 800K tokens

While the 1M token window is real, retrieval quality in our tests measurably declined for documents in the 800K–1M range compared to the same content at 400K tokens. This is not unique to GPT-5.4 — all current models degrade at the high end of their windows — but it means the full 1M window is not uniformly reliable for precision-dependent tasks.

Hallucination rate on recent events

GPT-5.4 still hallucinates on recent events and obscure technical details. In our code generation tests, it fabricated non-existent library APIs twice across 30 tasks — both cases involved less common Python packages where training data coverage is likely thin. Claude Opus 4.6 also hallucinated on one task in the same set. Neither model is reliable for code that touches APIs or libraries with sparse training data — always verify generated code against current documentation.

Computer use is not ready for arbitrary web automation

The 75% OSWorld score is benchmark performance on curated tasks. In our tests, computer use failed reliably when web pages used non-standard interaction patterns, JavaScript-heavy state management, or required precise timing. Do not assume OSWorld scores translate directly to your specific automation use case — test with representative workflows before committing.

MCP ecosystem is less mature than Claude's

If your existing agentic workflow relies on MCP tools — GitHub, Linear, Sentry, database connectors — these are generally more mature in the Claude ecosystem. GPT-5.4's Responses API supports function calling with similar capability, but fewer off-the-shelf integrations exist compared to Claude's official MCP server library.

When to Use GPT-5.4 vs the Alternatives

Use GPT-5.4 when:

Your workload is input-heavy: Document analysis, RAG pipelines, long system prompts — the $2.50/M input price is the best available among frontier models for these patterns
You need top coding benchmark performance: For agentic coding applications, 80% SWE-bench is the current leader among API-accessible models
Computer use is a core feature: Native Responses API computer use without third-party orchestration simplifies architecture for GUI automation applications
You are building greenfield: No legacy MCP integrations to migrate, so the ecosystem advantage of Claude is less relevant
Batch processing at scale: $1.25/$7.50 Batch API pricing makes GPT-5.4 compelling for bulk document processing and large-scale code analysis

Stick with Claude Opus 4.6 when:

You have existing Claude Code or MCP integrations — re-tooling cost exceeds the API price savings
Your application requires mature multi-agent orchestration with tools like Claude Code's subagent delegation
Safety and refusal behavior matter — Anthropic's Constitutional AI approach tends to produce more predictable refusal patterns for sensitive domains

Consider Gemini 3.1 Pro when:

Output generation is the dominant cost — $10.50/M output beats GPT-5.4's $15/M for generation-heavy workloads
You need context windows above 1M tokens (Gemini supports 2M)
You are building on Google Cloud and want native Vertex AI integration

For a broader look at how these models compare across all use cases, see our AI model comparison guide.

Verdict

GPT-5.4 earns its position as the current leader on SWE-bench Verified (80%) and offers the most competitive input pricing among frontier models ($2.50/M). For developers building new agentic coding applications or large-scale document processing pipelines, it is the most compelling API option in March 2026.

The calculus is different if you are already invested in the Claude ecosystem. Claude Opus 4.6 with Claude Code, MCP tools, and multi-agent orchestration is a more complete platform for complex agentic workflows — the raw benchmark numbers are slightly lower, but the tooling around the model is more mature. The cost gap is real (GPT-5.4 input is 83% cheaper), but re-tooling an existing Claude-based workflow is a non-trivial investment.

Gemini 3.1 Pro sits in an interesting middle position — cheaper output costs and a larger context window, but weaker coding benchmarks and less native computer use support. Worth evaluating if your use case is output-heavy or context-window-limited.

The honest assessment: all three models are close enough that use case, ecosystem fit, and existing infrastructure matter more than benchmark differences for most teams. Run a two-week parallel test on your actual workload before committing.

What is the GPT-5.4 API price?

GPT-5.4 API pricing is $2.50/M input tokens and $15/M output tokens. Cached inputs are $1.25/M. The Batch API option reduces costs to $1.25/$7.50 for non-real-time workloads. This makes GPT-5.4 the cheapest frontier model by input price — roughly 83% cheaper than Claude Opus 4.6's $15/M input.

What is GPT-5.4's SWE-bench score?

GPT-5.4 scores 80% on SWE-bench Verified, the standard benchmark for autonomous resolution of real GitHub issues. This leads Claude Opus 4.6 (72–77%) and Gemini 3.1 Pro (~68%). The 80% mark means the model resolves 400 of 500 verified issues without human guidance — a meaningful improvement for agentic coding applications.

How does GPT-5.4 computer use compare to Claude Opus 4.6?

GPT-5.4 scores 75% on OSWorld versus Claude Opus 4.6's approximately 70%. Both deliver computer use natively through their respective APIs (Responses API for GPT-5.4, Tools API for Opus 4.6). Both struggle with tasks requiring precise timing or spatial reasoning about dynamic UI state. GPT-5.4's edge is modest — real-world performance depends heavily on the specific automation task.

Does GPT-5.4 support a 1 million token context window?

Yes. GPT-5.4 supports 1M tokens via the Responses API. Retrieval quality is strong up to about 800K tokens and degrades somewhat at the high end. Gemini 3.1 Pro supports a larger 2M token window. Claude Opus 4.6 is limited to 200K tokens but uses agentic context management to compensate.

What does "83% superhuman professional tasks" mean?

It is an OpenAI internal benchmark score across curated professional task sets — medical, legal, coding, scientific, and financial domains — compared against human expert baselines. The 83% figure means GPT-5.4 outperformed human professionals on 83% of the benchmark tasks. This is a controlled evaluation, not a general claim about all professional work.

Should developers switch from Claude Opus 4.6 to GPT-5.4?

Possibly, for greenfield projects. GPT-5.4 leads on coding benchmarks and offers cheaper input pricing. For existing Claude-based workflows with MCP integrations and Claude Code, the re-tooling cost likely exceeds the API savings. The honest recommendation: run a parallel test on your specific workload for two to four weeks before committing to a switch.