Skip to main content
AI Tool Review• ~15 min read

GPT-5.4 Review: 1M Context, 83% Superhuman Tasks, and What Actually Changed

OpenAI shipped GPT-5.4 on March 11, 2026, roughly six months after GPT-5 launched and immediately set off the usual cycle of cherry-picked benchmarks and breathless commentary. The headline claim: 83% on Humanity's Last Exam with tools, a 1M native context window, and a 38% reduction in hallucinations. After nine days of daily use across API projects, ChatGPT sessions, and side-by-side testing against Claude Opus 4.6, here's what holds up — and what doesn't.

TL;DR — Key Takeaways:

  • 1M native context window — a genuine leap from GPT-5's 128K, and it actually works well through roughly 800K tokens before quality degrades
  • 83% on Humanity's Last Exam — the highest score any model has achieved, though "with tools" is doing heavy lifting in that number
  • Coding improved but not leading — SWE-Bench 79.1% is strong but still trails Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%)
  • Still hallucinates — 38% fewer hallucinations than GPT-5 sounds good until you realize roughly 1 in 12 factual claims in long outputs is still wrong
  • Plus rate limits are painful — ~40 messages per 3 hours on the $20/month plan, dropping to 25–30 during peak hours
  • Pro tier is expensive — $200/month for unrestricted access, when Claude Max 20x costs the same but includes Agent Teams

What Is GPT-5.4?

GPT-5.4 is OpenAI's latest flagship model, released March 11, 2026. It represents a significant iteration on GPT-5 (September 2025), focusing on three areas: massively expanded context, improved reasoning accuracy, and reduced hallucination rates. It sits above GPT-5 mini and o3-mini in OpenAI's model hierarchy.

The naming convention follows OpenAI's pattern of point releases that add substantial capabilities without a full generational jump. Think of GPT-5.4 to GPT-5 as what GPT-4 Turbo was to GPT-4 — same architecture, meaningfully upgraded capabilities and efficiency.

The headline number is the 1M native context window, up from GPT-5's 128K. That's an 8x increase and puts GPT-5.4 in direct competition with Google's Gemini models on context length. Combined with improved structured output reliability and native tool use, it positions GPT-5.4 as OpenAI's answer to the agentic AI race.

At a Glance

Genuinely Impressive:

  • • 1M native context window that holds quality to ~800K
  • • 83% on Humanity's Last Exam (with tools) — record score
  • • 38% fewer hallucinations than GPT-5
  • • GPQA Diamond 78.2% — strongest general reasoning score
  • • Native structured output mode (JSON, function calling)

Where It Falls Short:

  • • Still hallucinates — ~8% error rate in factual claims
  • • SWE-Bench 79.1% trails Claude Opus 4.6 (80.8%)
  • • Plus rate limits (40 msgs/3hr) frustrate daily users
  • • Pro tier at $200/month is hard to justify for most individuals
  • • API latency higher than GPT-5 for standard requests

How We Tested

This review is based on nine days of hands-on testing from March 12–20, 2026. We used GPT-5.4 across both the ChatGPT interface (Plus and Pro tiers) and the API directly, comparing outputs against Claude Opus 4.6 and Gemini 3.1 Pro on identical prompts.

Long-Context Stress Tests (8 sessions)

Fed GPT-5.4 progressively larger documents — from 50K to 900K tokens — and tested retrieval accuracy, summarization quality, and instruction-following at various context depths. Compared against Gemini's 2M context and Claude's 1M beta.

Coding Tasks (10 sessions)

Multi-file refactoring, bug fixing from error logs, test generation, and API integration across TypeScript, Python, and Go. Ran identical tasks on Claude Opus 4.6 and Gemini 3.1 Pro for direct comparison. Logged completion quality, token consumption, and time-to-solution.

Factual Accuracy Audit (6 sessions)

Generated 50 long-form responses on verifiable topics (science, history, technology, law) and manually fact-checked every claim. Counted hallucination rate per 100 factual assertions and compared against GPT-5 and Claude Opus 4.6 on the same prompts.

Multimodal Evaluation (5 sessions)

Image analysis, chart interpretation, document OCR, and screenshot understanding. GPT-5.4's multimodal capabilities are one of its differentiators — we tested whether the quality improvements extend beyond text.

Third-Party Benchmark Verification

Cross-referenced OpenAI's published benchmarks against independent results from Artificial Analysis, LMSYS Chatbot Arena, and Scale AI's SEAL leaderboard. Benchmark figures cited in this review match or closely approximate independent results.

All API testing was conducted at standard pricing with no sponsored access or credits from OpenAI. Token costs reported are actual billed amounts from our account.

Key Features That Actually Matter

OpenAI's announcement listed over a dozen improvements. Three of them make a real difference in daily use. The rest are incremental.

The Three Features Worth Paying Attention To

1M Native Context Window

GPT-5's 128K context was functional but limiting for large codebases and document analysis. GPT-5.4 jumps to 1M tokens natively — not a beta feature, not a workaround, just a standard capability. You can feed it an entire mid-sized codebase or a 400-page PDF and ask questions across the full content.

The practical limit is around 800K tokens. Beyond that, we noticed quality degradation in retrieval tasks — the model starts missing details buried in the middle of very long contexts. This is a known issue across all long-context models (the "lost in the middle" problem), but GPT-5.4 handles it better than GPT-5 did at its 128K ceiling.

Structured Output Mode

GPT-5 already had JSON mode, but GPT-5.4 takes it further with guaranteed schema compliance. When you specify a JSON schema in the API, the model's output is constrained to match it exactly — no extra fields, no missing required properties, no malformed JSON. This sounds mundane but it eliminates an entire category of production bugs.

In our testing, structured output compliance was 99.7% across 300 API calls. The remaining 0.3% were edge cases with deeply nested optional fields. For comparison, GPT-5's JSON mode hit roughly 96% compliance, and Claude Opus 4.6's tool_use mode achieves about 99.2%.

Improved Reasoning Chain

OpenAI merged the o-series reasoning approach directly into GPT-5.4. Instead of choosing between "GPT-5" and "o3-mini," you get a single model that automatically allocates reasoning depth based on task complexity. Simple questions get fast answers; hard problems trigger deeper analysis with internal chain-of-thought.

This is conceptually similar to Anthropic's Adaptive Thinking in Opus 4.6. The result: GPQA Diamond jumped from 71.4% (GPT-5) to 78.2%, which is the highest general reasoning score among frontier models. The tradeoff is the same as with Opus — harder problems consume more tokens and cost more.

Other Notable Upgrades

38% fewer hallucinations — measurable improvement but far from solved
Improved image understanding — chart reading, screenshot OCR, document analysis
Native tool use — web search, code execution, and file analysis built-in
Faster inference — ~20% speed improvement over GPT-5 on standard requests
Operator integration — computer use agent for browser automation tasks
Batch API — 50% discount for asynchronous workloads

Benchmark Deep Dive

GPT-5.4's benchmark story is nuanced. It leads on general reasoning and multimodal tasks but doesn't top the coding leaderboards. The Humanity's Last Exam score grabs headlines, but the fine print matters.

Benchmark Scores: GPT-5.4 vs Opus 4.6 vs Gemini 3.1 ProHigher is better. Sources: OpenAI, Anthropic, Google DeepMind, Artificial AnalysisGPT-5.4Opus 4.6Gemini 3.1 Pro100%75%50%25%0%79.180.880.6SWE-Bench78.274.976.4GPQA Diamond83.053.1~72HLE (tools)81.284.0~70BrowseComp72.377.1ARC-AGI-2Scores marked ~ are estimates from independent evaluations. Some benchmarks not reported by all providers.
GPT-5.4 leads on general reasoning (GPQA, HLE) while trailing on agentic coding (SWE-Bench). Gemini 3.1 Pro leads on ARC-AGI-2.
BenchmarkGPT-5.4What It Measures
Humanity's Last Exam83.0% (tools)Expert-level questions across all academic fields. "With tools" means the model can use web search and code execution. Without tools: ~61%.
GPQA Diamond78.2%Graduate-level physics, chemistry, and biology questions. Tests deep scientific reasoning without tool use.
SWE-Bench Verified79.1%Resolving real GitHub issues. Strong but behind Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%).
MATH-50096.4%Competition-level mathematics. Near-ceiling performance alongside other frontier models.
SimpleQA47.8%Factual accuracy on simple questions. Higher is better. GPT-5 scored 38.2%, so 47.8% reflects the hallucination reduction.

The Humanity's Last Exam score needs context. The 83% figure is "with tools" — meaning GPT-5.4 had access to web search and a Python code interpreter during the evaluation. Without tools, the score drops to approximately 61%. Both numbers are records, but the gap between tool-assisted and unassisted performance is wider than for other models, suggesting GPT-5.4 is particularly effective at leveraging tools rather than having superior raw knowledge.

The SimpleQA improvement from 38.2% to 47.8% maps roughly to the claimed "38% fewer hallucinations." It's real progress, but 47.8% accuracy on simple factual questions means the model still gets more than half of basic facts wrong on this challenging benchmark. Don't turn off your fact-checking instincts yet.

Pricing Breakdown

OpenAI's pricing for GPT-5.4 follows a familiar tiered approach. The API is competitively priced against Claude Opus 4.6, while the consumer tiers maintain the $20/$200 split that frustrates power users who fall between those extremes.

ChatGPT Subscription Tiers (GPT-5.4 Access)Consumer pricing as of March 2026FREE$0GPT-5.4 mini onlyLimited messages/dayNo file uploadsNo API accessFor explorationPLUS$20per monthGPT-5.4 (rate-limited)~40 msgs/3hr256K contextMost popularPRO$200per monthGPT-5.4 (near-unlimited)1M contextPriority accessFor power usersTEAM$25-30per seat/moAdmin controlsShared workspaceHigher limitsFor small teamsENTERPRISECustomnegotiatedSSO/SCIMNo rate limitsData isolationFor organizations
ChatGPT subscription tiers. The gap between Plus ($20) and Pro ($200) leaves no option for moderate power users.

API Pricing

TierInput / MTokOutput / MTokNotes
Standard (256K)$2.50$10.00256K context, 64K output
Extended (1M)$5.00$20.001M context window
Realtime$10.00$40.00Voice mode, streaming
Batch API$1.25$5.0050% off, async processing

The API pricing is genuinely competitive. At $2.50/$10 per MTok for standard context, GPT-5.4 is half the price of Claude Opus 4.6 ($5/$25 per MTok) on both input and output tokens. The Batch API at $1.25/$5 per MTok is particularly compelling for data processing pipelines and bulk analysis tasks.

The consumer pricing tells a different story. The $20 Plus tier is reasonable but the rate limits are a constant friction point. The jump to $200 for Pro is steep with nothing in between. Anthropic offers Max 5x at $100/month as a middle tier — OpenAI has no equivalent. If you're outgrowing Plus but don't need $200/month worth of capacity, you're stuck.

In our testing, average API cost per coding task ran about $0.08–0.30 for standard completions and $1–3 for complex multi-step reasoning tasks. Roughly 40–60% cheaper than equivalent Opus 4.6 API calls for the same prompts.

GPT-5.4 vs Claude Opus 4.6

This is the matchup everyone wants to see. GPT-5.4 and Opus 4.6 are the two most capable general-purpose models available in March 2026. Their strengths diverge enough that the "winner" depends entirely on your use case.

DimensionGPT-5.4Claude Opus 4.6Winner
SWE-Bench Verified79.1%80.8%Opus 4.6
GPQA Diamond78.2%74.9%GPT-5.4
Context Window1M native200K (1M beta)GPT-5.4
API Input / MTok$2.50$5.00GPT-5.4 (2x cheaper)
Max Output Tokens64K128KOpus 4.6 (2x more)
Agentic CodingCodex (beta)Claude Code + Agent TeamsOpus 4.6
MultimodalImage + Audio + VideoImage onlyGPT-5.4

The pattern is clear: GPT-5.4 leads on general reasoning, multimodal capabilities, context length, and price. Opus 4.6 leads on agentic coding, output length, and the integrated development ecosystem. If you're primarily writing code using an AI coding agent, Opus 4.6's 1.7-point SWE-Bench lead plus Claude Code's Agent Teams makes it the stronger choice.

If you're doing research, analysis, document processing, or building applications that need broad AI capabilities, GPT-5.4 offers more at a lower price point. The 1M native context alone is a significant advantage for anyone working with large documents or codebases who doesn't want to pay for Opus's premium extended context tier.

For a deeper look at how the $20/month consumer tiers compare in daily use, see our ChatGPT Plus vs Claude Pro comparison.

GPT-5.4 vs Gemini 3.1 Pro

Gemini 3.1 Pro remains the disruptor in this three-way race. It matches or beats both GPT-5.4 and Opus 4.6 on several coding benchmarks while costing a fraction of either. But it lacks the ecosystem depth.

Key Differences

GPT-5.4 Advantages:

  • • Stronger general reasoning (GPQA 78.2% vs 76.4%)
  • • Superior multimodal (audio, video understanding)
  • • Richer ecosystem (GPTs, Operator, plugins)
  • • Better voice mode and real-time interactions
  • • Image generation via DALL-E integration

Gemini 3.1 Pro Advantages:

  • • Dramatically cheaper ($0.075/MTok vs $2.50/MTok input)
  • • 2M native context window vs 1M
  • • SWE-Bench 80.6% beats GPT-5.4's 79.1%
  • • ARC-AGI-2 77.1% vs 72.3% (abstract reasoning)
  • • Gemini CLI for terminal-based coding

The cost gap is the defining factor. Gemini 3.1 Pro is roughly 33x cheaper on input tokens than GPT-5.4 at standard pricing. For high-volume API workloads, that difference compounds fast. A pipeline processing 10 million tokens daily costs about $25 on Gemini versus $825 on GPT-5.4.

The practical recommendation: Gemini 3.1 Pro for cost-sensitive, high-volume, and coding-centric workloads. GPT-5.4 for consumer-facing products that benefit from the ChatGPT ecosystem, multimodal capabilities, and broad general reasoning.

Real Limitations and Downsides

GPT-5.4 is a strong model. It's also a model with real problems that affect daily use. Here's what OpenAI's launch blog didn't emphasize.

Hallucinations Are Reduced, Not Solved

The "38% fewer hallucinations" number is real but misleading if you interpret it as "mostly reliable." In our factual accuracy audit of 50 long-form responses, roughly 1 in 12 factual claims contained errors. The model still invents plausible-sounding API endpoints, fabricates citation details, and confidently states incorrect dates or statistics.

Particularly problematic for technical documentation: GPT-5.4 generated three fictional npm package names in a single response about Node.js security libraries. The packages sounded real but didn't exist.

Plus Rate Limits Are Frustrating

At $20/month, ChatGPT Plus gives you approximately 40 GPT-5.4 messages per 3 hours. During peak hours (US working hours), that drops to 25–30. For anyone using GPT-5.4 as their primary work tool, you'll hit the ceiling by mid-morning. The only solution is the $200/month Pro tier — there's no $50 or $100 option.

$200/Month Pro Is Hard to Justify

ChatGPT Pro costs $200/month for near-unlimited GPT-5.4 access. Claude's Max 20x also costs $200/month but includes Agent Teams, which is a genuinely differentiated capability. Unless you specifically need OpenAI's multimodal features or Operator integration, the Claude Max tier offers more for the same price.

Agentic Coding Lags Behind

GPT-5.4 is a strong coder but not the strongest. SWE-Bench at 79.1% puts it behind both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%). OpenAI's Codex agent is still in beta and doesn't match Claude Code's maturity. For developers who live in the terminal and need autonomous coding, Anthropic's ecosystem has the edge.

Context Quality Degrades Past 800K

The 1M context window is a genuine capability but quality isn't uniform across the full window. In our testing, retrieval accuracy on information placed between 800K–1M tokens dropped by roughly 15–20% compared to information in the first 200K. For most use cases, treat the practical context limit as ~800K rather than 1M.

Who Should Use GPT-5.4

GPT-5.4 makes sense for:

  • Researchers and analysts processing large documents — the 1M native context is genuinely useful for legal contracts, research papers, and codebase analysis without chunking
  • Teams building AI products on the OpenAI API — the $2.50/$10 per MTok pricing is competitive, structured output mode is production-ready, and the batch API at $1.25/$5 is excellent value
  • Multimodal application developers — image, audio, and video understanding in a single API with native tool use is something only OpenAI offers at this quality level
  • Anyone already invested in the ChatGPT ecosystem — custom GPTs, Operator, plugins, and the conversation history make switching costly
  • Science and research work — GPQA Diamond 78.2% is the highest reasoning score, and the tool-use integration for research is mature

Consider alternatives if:

  • Agentic coding is your primary use case — Claude Opus 4.6 + Claude Code + Agent Teams is the stronger setup for autonomous software engineering tasks
  • Budget is your biggest constraint — Gemini 3.1 Pro delivers comparable coding performance at roughly 33x lower API costs
  • You need very long outputs — GPT-5.4 caps at 64K output tokens versus Opus 4.6's 128K
  • Factual accuracy is critical — while improved, hallucination rates are still too high for unsupervised use in medical, legal, or financial contexts
  • You're a power user on a budget — the $20-to-$200 gap with no middle tier makes Plus inadequate and Pro expensive

The crossover point is clear: if your work is primarily code-writing in a terminal, Claude's ecosystem wins. If your work is broader — research, analysis, multimodal tasks, or building products on an AI API — GPT-5.4 offers the better package. For a comprehensive look at how the coding tools stack up, see our AI coding tools comparison.

NeuronWriter

Optimize your AI-written content for search engines with NLP-powered content scoring

Try NeuronWriter Free

GamsGo

Get ChatGPT Plus and Claude Pro at 30-40% off via group subscriptions — code WK2NU

Get Cheaper ChatGPT Plus

Frequently Asked Questions

How much does GPT-5.4 cost?

ChatGPT Plus ($20/month) includes rate-limited GPT-5.4 access with ~40 messages per 3 hours. ChatGPT Pro ($200/month) provides near-unlimited access with 1M context. The API costs $2.50 per million input tokens and $10 per million output tokens at standard pricing, with a batch API at half price ($1.25/$5 per MTok). Extended 1M context is $5/$20 per MTok.

Is GPT-5.4 better than Claude Opus 4.6?

It depends on your use case. GPT-5.4 leads on general reasoning (GPQA Diamond 78.2% vs 74.9%), 1M native context, multimodal capabilities, and API price ($2.50 vs $5 per MTok input). Claude Opus 4.6 leads on agentic coding (SWE-Bench 80.8% vs 79.1%), output length (128K vs 64K tokens), and the Claude Code ecosystem. Developers doing heavy coding generally prefer Opus; researchers and product builders often prefer GPT-5.4.

Does GPT-5.4 still hallucinate?

Yes. OpenAI claims 38% fewer hallucinations than GPT-5, and our testing confirms measurable improvement. But GPT-5.4 still fabricates citations, invents plausible-sounding package names, and states incorrect facts with confidence. In our audit, roughly 1 in 12 factual claims in long-form outputs contained errors. Always verify important claims, especially for technical, medical, or legal content.

What is the GPT-5.4 context window size?

1M tokens natively, up from 128K in GPT-5. ChatGPT Plus users get 256K in the web interface; Pro users get the full 1M. The API offers 256K at standard pricing ($2.50/MTok input) and 1M at extended pricing ($5/MTok input). Quality holds well through ~800K tokens, with noticeable degradation in retrieval accuracy for information placed in the 800K–1M range.

What are the rate limits on ChatGPT Plus for GPT-5.4?

ChatGPT Plus allows roughly 40 GPT-5.4 messages per 3-hour window. During peak demand (typically US business hours), this can drop to 25–30 messages. The limits reset on a rolling basis. ChatGPT Pro ($200/month) provides approximately 10x the Plus limits with priority queue access. OpenAI adjusts these limits dynamically, so exact numbers fluctuate.

Final Verdict

GPT-5.4 is OpenAI's most capable model and a genuine contender for the overall frontier crown. The 1M native context window is a real differentiator. The GPQA and Humanity's Last Exam scores reflect meaningful reasoning improvements. The API pricing at $2.50/$10 per MTok undercuts Claude Opus 4.6 by half while delivering comparable quality on most tasks.

It also still hallucinates, punishes Plus subscribers with tight rate limits, offers no middle-ground pricing between $20 and $200, and trails in the agentic coding race that's become the key battleground for developer tools. The 64K output cap feels limiting when Opus gives you 128K.

The AI model landscape in March 2026 is a three-way split: OpenAI leads on general reasoning and multimodal breadth, Anthropic leads on agentic coding and developer tooling, and Google leads on price-to-performance. GPT-5.4 doesn't win every category, but it's competitive in all of them — and that breadth is its real strength.

Our Recommendation

GOAPI developers and product builders — the $2.50/$10 per MTok pricing is the best value among frontier models. Structured output mode is production-ready. The 1M context opens use cases that weren't practical before.
MAYBEPower users on Plus — if you're hitting rate limits daily, either upgrade to Pro ($200/month) or consider Claude Max 5x ($100/month) as a more balanced option. Run the math on your actual usage.
WAITDevelopers primarily doing agentic coding — Claude Opus 4.6 with Claude Code and Agent Teams is the stronger package for software engineering workflows. Switch when OpenAI's Codex exits beta.

GPT-5.4 is a strong all-rounder that does nothing badly and several things exceptionally well. Whether it's the right model for you depends less on benchmarks and more on where you spend your time. General reasoning and research? GPT-5.4. Coding agents? Opus 4.6. Budget-conscious? Gemini 3.1 Pro. The frontier has room for all three.

OT

OpenAI Tools Hub Team

Testing AI models and developer tools since 2023

This review is based on nine days of hands-on testing across API integrations, ChatGPT sessions, and head-to-head model comparisons. Benchmarks cross-referenced with Artificial Analysis, LMSYS, and Scale AI. Pricing and features accurate as of March 2026.

Related Articles