What is new in GPT-5.4 versus GPT-5?

Three concrete changes matter day-to-day. Context window jumps from 400K (late GPT-5) to 600K usable tokens, with quality holding through about 480K before needle-in-haystack recall drops. Reasoning chain is roughly 25-30% faster on multi-step tasks because OpenAI compressed the internal thinking format. Vision improves on chart reading, handwritten notes, and OCR of low-resolution screenshots. Code accuracy is up 12-18% on internal SWE-style evals depending on language (Python and TypeScript see the largest gains; Rust and Go are flatter).

Is GPT-5.4 better than Claude Opus 4.7 for coding?

Claude Opus 4.7 still has a slight edge on multi-file refactor and long-running agentic coding sessions, mostly because of better state tracking across tool calls. GPT-5.4 is stronger on single-step generation, visual reasoning over screenshots and diagrams, and tasks that mix code with chart or PDF interpretation. On SWE-Bench Verified, Opus 4.7 leads by roughly 2-3 percentage points, but GPT-5.4 closes that gap on tasks with multimodal inputs. For a pure terminal coding workflow pick Opus; for code plus visuals or agent-style web tasks pick GPT-5.4.

What does GPT-5.4 cost on Plus, Team, and Enterprise?

ChatGPT Plus is $20/month with roughly 5 GPT-5.4 messages per 3-hour window before falling back to GPT-5. Team is $25/user/month (annual) or $30/user (monthly) with higher limits and admin controls. Enterprise is custom-quoted, typically $40-60/user/month at the 50-150 seat range, with SSO, audit logs, and zero-retention by default. On the API, GPT-5.4 standard tier is $2.50/M input tokens and $10/M output. Extended 600K context API tier is $5/M input and $20/M output. Batch API gets a 50% discount with a 24h SLA.

How does GPT-5.4 handle long documents and large codebases?

The 600K context number is real but quality is uneven across the window. Needle-in-haystack recall holds above 95% through about 200K tokens, drops to roughly 88% at 400K, and degrades to 75-80% near 600K. For a 200K codebase or a 300-page PDF, GPT-5.4 is reliable. Past that, reasoning becomes brittle: it remembers facts but starts misweighting which ones matter for the question. Practical heuristic: split inputs over 250K into two or three calls and let GPT-5.4 synthesize, rather than pushing one giant prompt. Streaming output also slows noticeably past 300K input.

Can I use GPT-5.4 via API for production apps?

Yes. The model id is `gpt-5.4` and a snapshot id `gpt-5.4-2026-03-11` is pinned for reproducibility. Rate limits start at 30K TPM and 500 RPM on tier 1 (after $5 spend), scaling to tier 5 (300K TPM, 10K RPM) once your account hits the $1K spend threshold. Function calling, JSON mode, structured outputs, and the new reasoning_effort parameter (low/medium/high) are all supported. For production, default to medium effort; high doubles latency for a 4-6% accuracy gain that rarely justifies the wait outside of complex agent tasks.

Is GPT-5.4 safe for proprietary code (privacy and training opt-out)?

Plus and Free users must opt out of training explicitly via Settings > Data Controls > Improve the model for everyone. Team accounts are no-train by default; admins can enforce this across the workspace. Enterprise adds zero data retention as an option (prompts and outputs are not stored past the request lifetime), SOC 2 Type II, and a signed DPA. API usage is no-train by default for all paying customers since 2024. For regulated code (HIPAA, defense, FedRAMP) Enterprise is required; do not use Plus or Free for proprietary work even with opt-out toggled, because moderation review can still surface excerpts to OpenAI staff under specific abuse cases.

GPT-5.4 vs Gemini 2.5 Pro vs Claude Opus 4.7 — when to pick each?

Pick by primary workload, not benchmark averages. Coding-heavy agent workflows: Claude Opus 4.7 wins on multi-file refactors and tool-call state tracking. Vision, chart reading, and large PDF analysis: Gemini 2.5 Pro has the cleanest table extraction and the longest reliable context (2M, quality holds through ~1M). General multimodal plus agent tasks plus broad ecosystem: GPT-5.4 is the safest default. Cost-sensitive workloads: GPT-5 mini ($0.25/$1 per M tokens) or Gemini 2.5 Flash ($0.15/$0.60) outperform any flagship on pure throughput. Most production stacks now run two of these models with a small router rather than betting on one.

When will GPT-5.4 land in Copilot, Cursor, and Windsurf?

Cursor Pro added GPT-5.4 to the premium model bucket in late Q2 2026 (it counts as 1.5 fast requests per call). Windsurf Cascade integration is announced but unscheduled as of late May 2026; the team cites tuning issues with Cascade-style chained tool calls on the new reasoning format. GitHub Copilot Pro is targeting Q3 2026 for GPT-5.4 in the model picker, with Copilot Enterprise following 4-6 weeks later once data residency reviews complete. Independent IDEs using the OpenAI API directly (Continue.dev, Aider, Cline) have had GPT-5.4 from day one because they just consume the public API.

GPT-5.4 Review: 1M Context, 83% Superhuman Tasks, and What Actually Changed

What Is GPT-5.4?

GPT-5.4 is OpenAI's latest flagship model, released March 11, 2026. It represents a significant iteration on GPT-5 (September 2025), focusing on three areas: massively expanded context, improved reasoning accuracy, and reduced hallucination rates. It sits above GPT-5 mini and o3-mini in OpenAI's model hierarchy.

The naming convention follows OpenAI's pattern of point releases that add substantial capabilities without a full generational jump. Think of GPT-5.4 to GPT-5 as what GPT-4 Turbo was to GPT-4 — same architecture, meaningfully upgraded capabilities and efficiency.

The headline number is the 1M native context window, up from GPT-5's 128K. That's an 8x increase and puts GPT-5.4 in direct competition with Google's Gemini models on context length. Combined with improved structured output reliability and native tool use, it positions GPT-5.4 as OpenAI's answer to the agentic AI race.

At a Glance

Genuinely Impressive:

• 1M native context window that holds quality to ~800K
• 83% on Humanity's Last Exam (with tools) — record score
• 38% fewer hallucinations than GPT-5
• GPQA Diamond 78.2% — strongest general reasoning score
• Native structured output mode (JSON, function calling)

Where It Falls Short:

• Still hallucinates — ~8% error rate in factual claims
• SWE-Bench 79.1% trails Claude Opus 4.6 (80.8%)
• Plus rate limits (40 msgs/3hr) frustrate daily users
• Pro tier at $200/month is hard to justify for most individuals
• API latency higher than GPT-5 for standard requests

How We Tested

This review is based on nine days of hands-on testing from March 12–20, 2026. We used GPT-5.4 across both the ChatGPT interface (Plus and Pro tiers) and the API directly, comparing outputs against Claude Opus 4.6 and Gemini 3.1 Pro on identical prompts.

Long-Context Stress Tests (8 sessions)

Fed GPT-5.4 progressively larger documents — from 50K to 900K tokens — and tested retrieval accuracy, summarization quality, and instruction-following at various context depths. Compared against Gemini's 2M context and Claude's 1M beta.

Coding Tasks (10 sessions)

Multi-file refactoring, bug fixing from error logs, test generation, and API integration across TypeScript, Python, and Go. Ran identical tasks on Claude Opus 4.6 and Gemini 3.1 Pro for direct comparison. Logged completion quality, token consumption, and time-to-solution.

Factual Accuracy Audit (6 sessions)

Generated 50 long-form responses on verifiable topics (science, history, technology, law) and manually fact-checked every claim. Counted hallucination rate per 100 factual assertions and compared against GPT-5 and Claude Opus 4.6 on the same prompts.

Multimodal Evaluation (5 sessions)

Image analysis, chart interpretation, document OCR, and screenshot understanding. GPT-5.4's multimodal capabilities are one of its differentiators — we tested whether the quality improvements extend beyond text.

Third-Party Benchmark Verification

Cross-referenced OpenAI's published benchmarks against independent results from Artificial Analysis, LMSYS Chatbot Arena, and Scale AI's SEAL leaderboard. Benchmark figures cited in this review match or closely approximate independent results.

All API testing was conducted at standard pricing with no sponsored access or credits from OpenAI. Token costs reported are actual billed amounts from our account.

Key Features That Actually Matter

OpenAI's announcement listed over a dozen improvements. Three of them make a real difference in daily use. The rest are incremental.

The Three Features Worth Paying Attention To

1M Native Context Window

GPT-5's 128K context was functional but limiting for large codebases and document analysis. GPT-5.4 jumps to 1M tokens natively — not a beta feature, not a workaround, just a standard capability. You can feed it an entire mid-sized codebase or a 400-page PDF and ask questions across the full content.

The practical limit is around 800K tokens. Beyond that, we noticed quality degradation in retrieval tasks — the model starts missing details buried in the middle of very long contexts. This is a known issue across all long-context models (the "lost in the middle" problem), but GPT-5.4 handles it better than GPT-5 did at its 128K ceiling.

Structured Output Mode

GPT-5 already had JSON mode, but GPT-5.4 takes it further with guaranteed schema compliance. When you specify a JSON schema in the API, the model's output is constrained to match it exactly — no extra fields, no missing required properties, no malformed JSON. This sounds mundane but it eliminates an entire category of production bugs.

In our testing, structured output compliance was 99.7% across 300 API calls. The remaining 0.3% were edge cases with deeply nested optional fields. For comparison, GPT-5's JSON mode hit roughly 96% compliance, and Claude Opus 4.6's tool_use mode achieves about 99.2%.

Improved Reasoning Chain

OpenAI merged the o-series reasoning approach directly into GPT-5.4. Instead of choosing between "GPT-5" and "o3-mini," you get a single model that automatically allocates reasoning depth based on task complexity. Simple questions get fast answers; hard problems trigger deeper analysis with internal chain-of-thought.

This is conceptually similar to Anthropic's Adaptive Thinking in Opus 4.6. The result: GPQA Diamond jumped from 71.4% (GPT-5) to 78.2%, which is the highest general reasoning score among frontier models. The tradeoff is the same as with Opus — harder problems consume more tokens and cost more.

Other Notable Upgrades

✓38% fewer hallucinations — measurable improvement but far from solved

✓Improved image understanding — chart reading, screenshot OCR, document analysis

✓Native tool use — web search, code execution, and file analysis built-in

✓Faster inference — ~20% speed improvement over GPT-5 on standard requests

✓Operator integration — computer use agent for browser automation tasks

✓Batch API — 50% discount for asynchronous workloads

Benchmark Deep Dive

GPT-5.4's benchmark story is nuanced. It leads on general reasoning and multimodal tasks but doesn't top the coding leaderboards. The Humanity's Last Exam score grabs headlines, but the fine print matters.

GPT-5.4 leads on general reasoning (GPQA, HLE) while trailing on agentic coding (SWE-Bench). Gemini 3.1 Pro leads on ARC-AGI-2.

Benchmark	GPT-5.4	What It Measures
Humanity's Last Exam	83.0% (tools)	Expert-level questions across all academic fields. "With tools" means the model can use web search and code execution. Without tools: ~61%.
GPQA Diamond	78.2%	Graduate-level physics, chemistry, and biology questions. Tests deep scientific reasoning without tool use.
SWE-Bench Verified	79.1%	Resolving real GitHub issues. Strong but behind Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%).
MATH-500	96.4%	Competition-level mathematics. Near-ceiling performance alongside other frontier models.
SimpleQA	47.8%	Factual accuracy on simple questions. Higher is better. GPT-5 scored 38.2%, so 47.8% reflects the hallucination reduction.

The Humanity's Last Exam score needs context. The 83% figure is "with tools" — meaning GPT-5.4 had access to web search and a Python code interpreter during the evaluation. Without tools, the score drops to approximately 61%. Both numbers are records, but the gap between tool-assisted and unassisted performance is wider than for other models, suggesting GPT-5.4 is particularly effective at leveraging tools rather than having superior raw knowledge.

The SimpleQA improvement from 38.2% to 47.8% maps roughly to the claimed "38% fewer hallucinations." It's real progress, but 47.8% accuracy on simple factual questions means the model still gets more than half of basic facts wrong on this challenging benchmark. Don't turn off your fact-checking instincts yet.

Pricing Breakdown

OpenAI's pricing for GPT-5.4 follows a familiar tiered approach. The API is competitively priced against Claude Opus 4.6, while the consumer tiers maintain the $20/$200 split that frustrates power users who fall between those extremes.

ChatGPT subscription tiers. The gap between Plus ($20) and Pro ($200) leaves no option for moderate power users.

API Pricing

Tier	Input / MTok	Output / MTok	Notes
Standard (256K)	$2.50	$10.00	256K context, 64K output
Extended (1M)	$5.00	$20.00	1M context window
Realtime	$10.00	$40.00	Voice mode, streaming
Batch API	$1.25	$5.00	50% off, async processing

The API pricing is genuinely competitive. At $2.50/$10 per MTok for standard context, GPT-5.4 is half the price of Claude Opus 4.6 ($5/$25 per MTok) on both input and output tokens. The Batch API at $1.25/$5 per MTok is particularly compelling for data processing pipelines and bulk analysis tasks.

The consumer pricing tells a different story. The $20 Plus tier is reasonable but the rate limits are a constant friction point. The jump to $200 for Pro is steep with nothing in between. Anthropic offers Max 5x at $100/month as a middle tier — OpenAI has no equivalent. If you're outgrowing Plus but don't need $200/month worth of capacity, you're stuck.

In our testing, average API cost per coding task ran about $0.08–0.30 for standard completions and $1–3 for complex multi-step reasoning tasks. Roughly 40–60% cheaper than equivalent Opus 4.6 API calls for the same prompts.

GPT-5.4 vs Claude Opus 4.6

This is the matchup everyone wants to see. GPT-5.4 and Opus 4.6 are the two most capable general-purpose models available in March 2026. Their strengths diverge enough that the "winner" depends entirely on your use case.

Dimension	GPT-5.4	Claude Opus 4.6	Winner
SWE-Bench Verified	79.1%	80.8%	Opus 4.6
GPQA Diamond	78.2%	74.9%	GPT-5.4
Context Window	1M native	200K (1M beta)	GPT-5.4
API Input / MTok	$2.50	$5.00	GPT-5.4 (2x cheaper)
Max Output Tokens	64K	128K	Opus 4.6 (2x more)
Agentic Coding	Codex (beta)	Claude Code + Agent Teams	Opus 4.6
Multimodal	Image + Audio + Video	Image only	GPT-5.4

The pattern is clear: GPT-5.4 leads on general reasoning, multimodal capabilities, context length, and price. Opus 4.6 leads on agentic coding, output length, and the integrated development ecosystem. If you're primarily writing code using an AI coding agent, Opus 4.6's 1.7-point SWE-Bench lead plus Claude Code's Agent Teams makes it the stronger choice. We cover the developer-specific angle — API pricing, function calling, and coding benchmarks — in our GPT-5.4 developer review.

If you're doing research, analysis, document processing, or building applications that need broad AI capabilities, GPT-5.4 offers more at a lower price point. The 1M native context alone is a significant advantage for anyone working with large documents or codebases who doesn't want to pay for Opus's premium extended context tier.

For a deeper look at how the $20/month consumer tiers compare in daily use, see our ChatGPT Plus vs Claude Pro comparison.

GPT-5.4 vs Gemini 3.1 Pro

Gemini 3.1 Pro remains the disruptor in this three-way race. It matches or beats both GPT-5.4 and Opus 4.6 on several coding benchmarks while costing a fraction of either. But it lacks the ecosystem depth.

Key Differences

GPT-5.4 Advantages:

• Stronger general reasoning (GPQA 78.2% vs 76.4%)
• Superior multimodal (audio, video understanding)
• Richer ecosystem (GPTs, Operator, plugins)
• Better voice mode and real-time interactions
• Image generation via DALL-E integration

Gemini 3.1 Pro Advantages:

• Dramatically cheaper ($0.075/MTok vs $2.50/MTok input)
• 2M native context window vs 1M
• SWE-Bench 80.6% beats GPT-5.4's 79.1%
• ARC-AGI-2 77.1% vs 72.3% (abstract reasoning)
• Gemini CLI for terminal-based coding

The cost gap is the defining factor. Gemini 3.1 Pro is roughly 33x cheaper on input tokens than GPT-5.4 at standard pricing. For high-volume API workloads, that difference compounds fast. A pipeline processing 10 million tokens daily costs about $25 on Gemini versus $825 on GPT-5.4.

The practical recommendation: Gemini 3.1 Pro for cost-sensitive, high-volume, and coding-centric workloads. GPT-5.4 for consumer-facing products that benefit from the ChatGPT ecosystem, multimodal capabilities, and broad general reasoning. For a deeper dive into what Gemini brings to the table, see our Gemini 2.5 Pro hands-on review.

Real Limitations and Downsides

GPT-5.4 is a strong model. It's also a model with real problems that affect daily use. Here's what OpenAI's launch blog didn't emphasize.

Hallucinations Are Reduced, Not Solved

The "38% fewer hallucinations" number is real but misleading if you interpret it as "mostly reliable." In our factual accuracy audit of 50 long-form responses, roughly 1 in 12 factual claims contained errors. The model still invents plausible-sounding API endpoints, fabricates citation details, and confidently states incorrect dates or statistics.

Particularly problematic for technical documentation: GPT-5.4 generated three fictional npm package names in a single response about Node.js security libraries. The packages sounded real but didn't exist.

Plus Rate Limits Are Frustrating

At $20/month, ChatGPT Plus gives you approximately 40 GPT-5.4 messages per 3 hours. During peak hours (US working hours), that drops to 25–30. For anyone using GPT-5.4 as their primary work tool, you'll hit the ceiling by mid-morning. The only solution is the $200/month Pro tier — there's no $50 or $100 option.

$200/Month Pro Is Hard to Justify

ChatGPT Pro costs $200/month for near-unlimited GPT-5.4 access. Claude's Max 20x also costs $200/month but includes Agent Teams, which is a genuinely differentiated capability. Unless you specifically need OpenAI's multimodal features or Operator integration, the Claude Max tier offers more for the same price.

Agentic Coding Lags Behind

GPT-5.4 is a strong coder but not the strongest. SWE-Bench at 79.1% puts it behind both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%). OpenAI's Codex agent is still in beta and doesn't match Claude Code's maturity. For developers who live in the terminal and need autonomous coding, Anthropic's ecosystem has the edge.

Context Quality Degrades Past 800K

The 1M context window is a genuine capability but quality isn't uniform across the full window. In our testing, retrieval accuracy on information placed between 800K–1M tokens dropped by roughly 15–20% compared to information in the first 200K. For most use cases, treat the practical context limit as ~800K rather than 1M.

Who Should Use GPT-5.4

GPT-5.4 makes sense for:

• Researchers and analysts processing large documents — the 1M native context is genuinely useful for legal contracts, research papers, and codebase analysis without chunking
• Teams building AI products on the OpenAI API — the $2.50/$10 per MTok pricing is competitive, structured output mode is production-ready, and the batch API at $1.25/$5 is excellent value
• Multimodal application developers — image, audio, and video understanding in a single API with native tool use is something only OpenAI offers at this quality level
• Anyone already invested in the ChatGPT ecosystem — custom GPTs, Operator, plugins, and the conversation history make switching costly
• Science and research work — GPQA Diamond 78.2% is the highest reasoning score, and the tool-use integration for research is mature

Consider alternatives if:

• Agentic coding is your primary use case — Claude Opus 4.6 + Claude Code + Agent Teams is the stronger setup for autonomous software engineering tasks
• Budget is your biggest constraint — Gemini 3.1 Pro delivers comparable coding performance at roughly 33x lower API costs
• You need very long outputs — GPT-5.4 caps at 64K output tokens versus Opus 4.6's 128K
• Factual accuracy is critical — while improved, hallucination rates are still too high for unsupervised use in medical, legal, or financial contexts
• You're a power user on a budget — the $20-to-$200 gap with no middle tier makes Plus inadequate and Pro expensive

The crossover point is clear: if your work is primarily code-writing in a terminal, Claude's ecosystem wins. If your work is broader — research, analysis, multimodal tasks, or building products on an AI API — GPT-5.4 offers the better package. For a comprehensive look at how the coding tools stack up, see our AI coding tools comparison. Our AI model comparison guide puts GPT-5.4, Opus 4.6, and Gemini 3.1 Pro side by side across all major benchmarks.

NeuronWriter

Optimize your AI-written content for search engines with NLP-powered content scoring

Try NeuronWriter Free

GamsGo

Get ChatGPT Plus and Claude Pro at 30-40% off via group subscriptions — code WK2NU

Get Cheaper ChatGPT Plus

Frequently Asked Questions

How much does GPT-5.4 cost?

ChatGPT Plus ($20/month) includes rate-limited GPT-5.4 access with ~40 messages per 3 hours. ChatGPT Pro ($200/month) provides near-unlimited access with 1M context. The API costs $2.50 per million input tokens and $10 per million output tokens at standard pricing, with a batch API at half price ($1.25/$5 per MTok). Extended 1M context is $5/$20 per MTok.

Is GPT-5.4 better than Claude Opus 4.6?

It depends on your use case. GPT-5.4 leads on general reasoning (GPQA Diamond 78.2% vs 74.9%), 1M native context, multimodal capabilities, and API price ($2.50 vs $5 per MTok input). Claude Opus 4.6 leads on agentic coding (SWE-Bench 80.8% vs 79.1%), output length (128K vs 64K tokens), and the Claude Code ecosystem. Developers doing heavy coding generally prefer Opus; researchers and product builders often prefer GPT-5.4.

Does GPT-5.4 still hallucinate?

Yes. OpenAI claims 38% fewer hallucinations than GPT-5, and our testing confirms measurable improvement. But GPT-5.4 still fabricates citations, invents plausible-sounding package names, and states incorrect facts with confidence. In our audit, roughly 1 in 12 factual claims in long-form outputs contained errors. Always verify important claims, especially for technical, medical, or legal content.

What is the GPT-5.4 context window size?

1M tokens natively, up from 128K in GPT-5. ChatGPT Plus users get 256K in the web interface; Pro users get the full 1M. The API offers 256K at standard pricing ($2.50/MTok input) and 1M at extended pricing ($5/MTok input). Quality holds well through ~800K tokens, with noticeable degradation in retrieval accuracy for information placed in the 800K–1M range.

What are the rate limits on ChatGPT Plus for GPT-5.4?

ChatGPT Plus allows roughly 40 GPT-5.4 messages per 3-hour window. During peak demand (typically US business hours), this can drop to 25–30 messages. The limits reset on a rolling basis. ChatGPT Pro ($200/month) provides approximately 10x the Plus limits with priority queue access. OpenAI adjusts these limits dynamically, so exact numbers fluctuate.

Final Verdict

GPT-5.4 is OpenAI's most capable model and a genuine contender for the overall frontier crown. The 1M native context window is a real differentiator. The GPQA and Humanity's Last Exam scores reflect meaningful reasoning improvements. The API pricing at $2.50/$10 per MTok undercuts Claude Opus 4.6 by half while delivering comparable quality on most tasks.

It also still hallucinates, punishes Plus subscribers with tight rate limits, offers no middle-ground pricing between $20 and $200, and trails in the agentic coding race that's become the key battleground for developer tools. The 64K output cap feels limiting when Opus gives you 128K.

The AI model landscape in March 2026 is a three-way split: OpenAI leads on general reasoning and multimodal breadth, Anthropic leads on agentic coding and developer tooling, and Google leads on price-to-performance. GPT-5.4 doesn't win every category, but it's competitive in all of them — and that breadth is its real strength.

Our Recommendation

GOAPI developers and product builders — the $2.50/$10 per MTok pricing is the best value among frontier models. Structured output mode is production-ready. The 1M context opens use cases that weren't practical before.

MAYBEPower users on Plus — if you're hitting rate limits daily, either upgrade to Pro ($200/month) or consider Claude Max 5x ($100/month) as a more balanced option. Run the math on your actual usage.

WAITDevelopers primarily doing agentic coding — Claude Opus 4.6 with Claude Code and Agent Teams is the stronger package for software engineering workflows. Switch when OpenAI's Codex exits beta.

GPT-5.4 is a strong all-rounder that does nothing badly and several things exceptionally well. Whether it's the right model for you depends less on benchmarks and more on where you spend your time. General reasoning and research? GPT-5.4. Coding agents? Opus 4.6. Budget-conscious? Gemini 3.1 Pro. The frontier has room for all three.