Skip to main content
Deep Dive

GPT-5.4 for Developers: API Pricing, Computer Use, and SWE-bench 80%

OpenAI's GPT-5.4 changes the API cost equation, hits 80% on SWE-bench Verified, and ships native computer use at 75% OSWorld. Here is what it means in practice for developers building on top of frontier models.

March 20, 2026·13 min read·OpenAI Tools Hub Team

Key Takeaways

  • API pricing: $2.50/$15 per million tokens — roughly half the input cost of Claude Opus 4.6 ($15 input), making it the most cost-effective frontier model for input-heavy applications
  • SWE-bench Verified: 80% — leads Claude Opus 4.6 (72–77%) and Gemini 3.1 Pro (~68%) on autonomous coding benchmarks
  • Computer use: 75% on OSWorld — native GUI automation via the Responses API, no third-party orchestration needed
  • Context window: 1M tokens — roughly 30,000 lines of dense code in a single API request
  • 83% superhuman professional tasks on OpenAI's internal cross-domain evaluation (medical, legal, coding, finance)
  • Downside: Output tokens cost $15/M — the same as Claude Opus 4.6 output pricing — so the cost advantage depends on your input-to-output ratio

What Changed Between GPT-5.3 and GPT-5.4

GPT-5.4 is not a frontier breakthrough in the way GPT-4 or GPT-4o were — it is a disciplined step-improvement on top of GPT-5.3. The changes that matter for developers are: a doubled context window (512K → 1M tokens), native computer use shipped via the Responses API, revised API pricing that halves input token costs, and improved coding benchmark scores that push SWE-bench Verified past 80%.

The 83% superhuman professional tasks figure comes from OpenAI's internal evaluation suite — a cross-domain benchmark covering medical diagnosis reasoning, legal document analysis, scientific problem sets, financial modeling tasks, and software engineering. It is a controlled comparison against human expert baselines, not a general capability claim.

From a developer perspective, the API pricing change is arguably the most impactful update. Input tokens dropped from approximately $5/M (GPT-5.3) to $2.50/M. For applications that send large system prompts, documents, or long conversation histories with every request, this is a meaningful cost reduction.

How We Tested

We evaluated GPT-5.4 via the OpenAI Responses API over two weeks in March 2026, alongside Claude Opus 4.6 and Gemini 3.1 Pro, using the same task sets for each model:

  • Code generation: 30 tasks — component generation, API endpoint implementation, bug fixes, and test writing across TypeScript and Python
  • Long-context retrieval: 10 documents in the 200K–900K token range, with targeted questions requiring extraction and reasoning across the full context
  • Computer use (GUI automation): 15 tasks using the built-in computer use tools — browser navigation, form filling, file management, and app interactions
  • API cost tracking: Measured token usage and costs per task across all three models using identical prompts

Benchmark references: SWE-bench Verified scores from the official SWE-bench leaderboard. OSWorld scores from the OSWorld benchmark site. API pricing verified at platform.openai.com at time of writing.

API Pricing Breakdown

The headline: GPT-5.4 input tokens cost $2.50/M — roughly 83% cheaper than Claude Opus 4.6's $15/M input and substantially cheaper than Gemini 3.1 Pro's $3.50/M. Output tokens are $15/M, on par with Opus 4.6 output.

ModelInput ($/M tokens)Output ($/M tokens)Cached InputContext
GPT-5.4$2.50$15.00$1.251M tokens
Claude Opus 4.6$15.00$75.00$1.50200K tokens
Gemini 3.1 Pro$3.50$10.50$0.8752M tokens
GPT-5.4 Batch API$1.25$7.501M tokens

The cost advantage shifts significantly based on input-to-output ratio. An application that sends 100K token prompts and receives 1K token responses is heavily input-weighted — at that ratio, GPT-5.4's $2.50/M input cost versus Opus 4.6's $15/M input cost delivers a 6x saving on input spend. But an application generating long-form outputs (reports, code files, documentation) will see the $15/M output cost erode the advantage.

For batch processing use cases — document analysis, bulk code review, data extraction — the Batch API at $1.25/$7.50 makes GPT-5.4 one of the cheapest frontier model options available. Against Gemini 3.1 Pro's output pricing ($10.50/M), GPT-5.4 Batch still wins on output if generation volume is high.

SWE-bench 80%: What It Actually Means

SWE-bench Verified is the standard benchmark for evaluating AI coding agents on real software engineering tasks. It consists of 500 verified GitHub issues from popular open-source Python repositories — the model receives the repository state before the fix and must produce a patch that passes all relevant tests.

GPT-5.4's 80% score means it resolved 400 of the 500 issues autonomously without human guidance. For context: Claude Opus 4.6 with Claude Code sits at 72–77%, and Gemini 3.1 Pro's underlying model scores approximately 68%. A year ago, the state-of-the-art was around 50%.

What 80% means in practice: on typical bug fixes, targeted feature implementations, and isolated refactors, GPT-5.4 is reliable enough to handle autonomously with review. The remaining 20% failures tend to cluster around: tasks requiring architectural decisions the model cannot infer from code alone, issues requiring understanding of external system state (database schemas, API contracts), and tasks where the test suite itself has gaps.

In our code generation tests, GPT-5.4 produced working TypeScript component implementations on 27 of 30 tasks on the first pass. The three failures involved type inference across generics in complex utility types — the model's suggestions were logically correct but introduced TypeScript errors that required one or two rounds of correction. Claude Opus 4.6 scored 25/30 on the same tasks, with failures in similar edge-case areas.

Computer Use: OSWorld 75% and the Responses API

GPT-5.4's native computer use is the most architecturally significant change. It ships through the Responses API with built-in tool definitions for browser control, file system interaction, and application navigation — no separate orchestration layer required.

OSWorld is the standard benchmark for computer use agents. It tests 369 tasks across real desktop applications — file management, web browsing, spreadsheet manipulation, and cross-application workflows. GPT-5.4 scores 75% on the full benchmark. For comparison, Claude Opus 4.6 (which introduced computer use to the frontier) was evaluated at approximately 70% on OSWorld's comparable task set at launch.

CapabilityGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
OSWorld score75%~70%~65%
Computer use APINative (Responses API)Native (Tools API)Via extensions
Browser automationBuilt-inBuilt-in3rd-party required
File system interactionBuilt-inVia MCP toolsVia extensions
Multi-step GUI tasksStrongStrongModerate

In our computer use testing, GPT-5.4 handled 11 of 15 tasks successfully. The four failures involved tasks requiring spatial reasoning about screen layout — clicking an element that only appeared after another element was in a specific state. This is a known weakness of current computer use implementations, not specific to GPT-5.4. Claude Opus 4.6 failed on similar spatial-dependency tasks. Both models struggle when UI state changes are driven by JavaScript with non-obvious timing.

For developers, the practical implication is that GPT-5.4 computer use is reliable enough for structured automation tasks — form filling, data extraction from GUIs, navigating well-defined web application workflows. It is not yet reliable enough for fully autonomous browsing of arbitrary websites where layout and interaction patterns are unpredictable.

1M Context Window in Practice

GPT-5.4's 1M token context window puts it alongside Gemini 3.1 Pro (2M tokens) as one of the few frontier models capable of ingesting entire codebases or large document collections in a single request. Claude Opus 4.6's 200K context is five times smaller.

In our long-context tests, GPT-5.4 successfully retrieved specific details from documents in the 700K–900K token range — though response quality degraded noticeably above 800K tokens compared to the 200K–400K range. This degradation is expected: all current models show some reduction in retrieval accuracy at the high end of their context windows.

For practical use cases:

  • Codebase analysis: Ingest 20,000–25,000 lines of code for architecture questions, cross-file dependency tracing, and documentation generation
  • Document processing: Full PDF books, regulatory filings, or legal contracts in a single pass
  • Long conversation histories: Multi-session context without manual summarization for customer support agents
  • Data extraction: Large CSV or JSON datasets processed in one request rather than chunked pipelines

The cost implication is worth modeling carefully. A 500K token input at $2.50/M costs $1.25 per request. At production volumes of 1,000 requests/day, that is $1,250/day in input costs alone. Caching reduces this — repeated system prompts or document context cached at $1.25/M halve the per-request cost for subsequent calls within the caching window.

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro

The three leading frontier models are now close enough in raw capability that the decision criteria for most developers are: cost structure, ecosystem fit, and specific benchmark performance on tasks matching your use case.

MetricGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
SWE-bench Verified80%72–77%~68%
OSWorld (computer use)75%~70%~65%
Context window1M tokens200K tokens2M tokens
Input price ($/M)$2.50$15.00$3.50
Output price ($/M)$15.00$75.00$10.50
Native computer useYesYesExtension only
MCP ecosystemGrowingMatureGrowing
Multi-agent orchestrationResponses APIClaude Code + MCPVertex AI Agents
Structured outputs (JSON)Strict modeStrongStrong
Open-source optionNoNoNo

Claude Opus 4.6 still leads on output pricing for high-generation workloads — at $75/M output versus GPT-5.4's $15/M, the cost gap actually favors GPT-5.4. But Opus 4.6's MCP ecosystem, Claude Code integration, and multi-agent tooling (via Claude's agentic coding tools) remain ahead for complex agentic workflows. If you are building on top of Claude Code or using MCP integrations extensively, switching to GPT-5.4 API requires re-tooling that may not be worth the cost savings.

Gemini 3.1 Pro is the outlier: larger context window (2M tokens), lower output pricing ($10.50/M), but weaker on coding benchmarks and with computer use delivered through extensions rather than native API tooling. For applications that prioritize context capacity and output cost over coding accuracy, Gemini 3.1 Pro is worth evaluating.

Real Downsides

GPT-5.4 is not the right choice for every use case. Here are the genuine limitations that came up in our testing.

Output pricing is unchanged from Opus 4.6

The $15/M output token price is the same as Claude Opus 4.6's output cost. For applications generating long outputs — full code files, detailed reports, documentation — the cost advantage of GPT-5.4's cheaper input pricing disappears if output volume is high. Gemini 3.1 Pro's $10.50/M output is cheaper for generation-heavy workloads.

Context quality degrades above 800K tokens

While the 1M token window is real, retrieval quality in our tests measurably declined for documents in the 800K–1M range compared to the same content at 400K tokens. This is not unique to GPT-5.4 — all current models degrade at the high end of their windows — but it means the full 1M window is not uniformly reliable for precision-dependent tasks.

Hallucination rate on recent events

GPT-5.4 still hallucinates on recent events and obscure technical details. In our code generation tests, it fabricated non-existent library APIs twice across 30 tasks — both cases involved less common Python packages where training data coverage is likely thin. Claude Opus 4.6 also hallucinated on one task in the same set. Neither model is reliable for code that touches APIs or libraries with sparse training data — always verify generated code against current documentation.

Computer use is not ready for arbitrary web automation

The 75% OSWorld score is benchmark performance on curated tasks. In our tests, computer use failed reliably when web pages used non-standard interaction patterns, JavaScript-heavy state management, or required precise timing. Do not assume OSWorld scores translate directly to your specific automation use case — test with representative workflows before committing.

MCP ecosystem is less mature than Claude's

If your existing agentic workflow relies on MCP tools — GitHub, Linear, Sentry, database connectors — these are generally more mature in the Claude ecosystem. GPT-5.4's Responses API supports function calling with similar capability, but fewer off-the-shelf integrations exist compared to Claude's official MCP server library.

When to Use GPT-5.4 vs the Alternatives

Use GPT-5.4 when:

  • Your workload is input-heavy: Document analysis, RAG pipelines, long system prompts — the $2.50/M input price is the best available among frontier models for these patterns
  • You need top coding benchmark performance: For agentic coding applications, 80% SWE-bench is the current leader among API-accessible models
  • Computer use is a core feature: Native Responses API computer use without third-party orchestration simplifies architecture for GUI automation applications
  • You are building greenfield: No legacy MCP integrations to migrate, so the ecosystem advantage of Claude is less relevant
  • Batch processing at scale: $1.25/$7.50 Batch API pricing makes GPT-5.4 compelling for bulk document processing and large-scale code analysis

Stick with Claude Opus 4.6 when:

  • You have existing Claude Code or MCP integrations — re-tooling cost exceeds the API price savings
  • Your application requires mature multi-agent orchestration with tools like Claude Code's subagent delegation
  • Safety and refusal behavior matter — Anthropic's Constitutional AI approach tends to produce more predictable refusal patterns for sensitive domains

Consider Gemini 3.1 Pro when:

  • Output generation is the dominant cost — $10.50/M output beats GPT-5.4's $15/M for generation-heavy workloads
  • You need context windows above 1M tokens (Gemini supports 2M)
  • You are building on Google Cloud and want native Vertex AI integration

For a broader look at how these models compare across all use cases, see our AI model comparison guide.

Verdict

GPT-5.4 earns its position as the current leader on SWE-bench Verified (80%) and offers the most competitive input pricing among frontier models ($2.50/M). For developers building new agentic coding applications or large-scale document processing pipelines, it is the most compelling API option in March 2026.

The calculus is different if you are already invested in the Claude ecosystem. Claude Opus 4.6 with Claude Code, MCP tools, and multi-agent orchestration is a more complete platform for complex agentic workflows — the raw benchmark numbers are slightly lower, but the tooling around the model is more mature. The cost gap is real (GPT-5.4 input is 83% cheaper), but re-tooling an existing Claude-based workflow is a non-trivial investment.

Gemini 3.1 Pro sits in an interesting middle position — cheaper output costs and a larger context window, but weaker coding benchmarks and less native computer use support. Worth evaluating if your use case is output-heavy or context-window-limited.

The honest assessment: all three models are close enough that use case, ecosystem fit, and existing infrastructure matter more than benchmark differences for most teams. Run a two-week parallel test on your actual workload before committing.

FAQ

What is the GPT-5.4 API price?

GPT-5.4 API pricing is $2.50/M input tokens and $15/M output tokens. Cached inputs are $1.25/M. The Batch API option reduces costs to $1.25/$7.50 for non-real-time workloads. This makes GPT-5.4 the cheapest frontier model by input price — roughly 83% cheaper than Claude Opus 4.6's $15/M input.

What is GPT-5.4's SWE-bench score?

GPT-5.4 scores 80% on SWE-bench Verified, the standard benchmark for autonomous resolution of real GitHub issues. This leads Claude Opus 4.6 (72–77%) and Gemini 3.1 Pro (~68%). The 80% mark means the model resolves 400 of 500 verified issues without human guidance — a meaningful improvement for agentic coding applications.

How does GPT-5.4 computer use compare to Claude Opus 4.6?

GPT-5.4 scores 75% on OSWorld versus Claude Opus 4.6's approximately 70%. Both deliver computer use natively through their respective APIs (Responses API for GPT-5.4, Tools API for Opus 4.6). Both struggle with tasks requiring precise timing or spatial reasoning about dynamic UI state. GPT-5.4's edge is modest — real-world performance depends heavily on the specific automation task.

Does GPT-5.4 support a 1 million token context window?

Yes. GPT-5.4 supports 1M tokens via the Responses API. Retrieval quality is strong up to about 800K tokens and degrades somewhat at the high end. Gemini 3.1 Pro supports a larger 2M token window. Claude Opus 4.6 is limited to 200K tokens but uses agentic context management to compensate.

What does "83% superhuman professional tasks" mean?

It is an OpenAI internal benchmark score across curated professional task sets — medical, legal, coding, scientific, and financial domains — compared against human expert baselines. The 83% figure means GPT-5.4 outperformed human professionals on 83% of the benchmark tasks. This is a controlled evaluation, not a general claim about all professional work.

Should developers switch from Claude Opus 4.6 to GPT-5.4?

Possibly, for greenfield projects. GPT-5.4 leads on coding benchmarks and offers cheaper input pricing. For existing Claude-based workflows with MCP integrations and Claude Code, the re-tooling cost likely exceeds the API savings. The honest recommendation: run a parallel test on your specific workload for two to four weeks before committing to a switch.

GamsGo

Using ChatGPT Plus or Claude Pro for coding? Get shared-plan access to ChatGPT Plus, Claude Pro, and other AI subscriptions at 30-70% off through GamsGo.

Get AI Subscriptions Cheaper

NeuronWriter

Writing technical content? Benchmark your articles against top-ranking Google results before publishing — used by 50,000+ creators.

Analyze Your Content Free