GPT-5.4 Review: 1M Context, 83% Superhuman Tasks, and What Actually Changed
OpenAI shipped GPT-5.4 on March 11, 2026, roughly six months after GPT-5 launched and immediately set off the usual cycle of cherry-picked benchmarks and breathless commentary. The headline claim: 83% on Humanity's Last Exam with tools, a 1M native context window, and a 38% reduction in hallucinations. After nine days of daily use across API projects, ChatGPT sessions, and side-by-side testing against Claude Opus 4.6, here's what holds up — and what doesn't.
TL;DR — Key Takeaways:
- • 1M native context window — a genuine leap from GPT-5's 128K, and it actually works well through roughly 800K tokens before quality degrades
- • 83% on Humanity's Last Exam — the highest score any model has achieved, though "with tools" is doing heavy lifting in that number
- • Coding improved but not leading — SWE-Bench 79.1% is strong but still trails Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%)
- • Still hallucinates — 38% fewer hallucinations than GPT-5 sounds good until you realize roughly 1 in 12 factual claims in long outputs is still wrong
- • Plus rate limits are painful — ~40 messages per 3 hours on the $20/month plan, dropping to 25–30 during peak hours
- • Pro tier is expensive — $200/month for unrestricted access, when Claude Max 20x costs the same but includes Agent Teams
What Is GPT-5.4?
GPT-5.4 is OpenAI's latest flagship model, released March 11, 2026. It represents a significant iteration on GPT-5 (September 2025), focusing on three areas: massively expanded context, improved reasoning accuracy, and reduced hallucination rates. It sits above GPT-5 mini and o3-mini in OpenAI's model hierarchy.
The naming convention follows OpenAI's pattern of point releases that add substantial capabilities without a full generational jump. Think of GPT-5.4 to GPT-5 as what GPT-4 Turbo was to GPT-4 — same architecture, meaningfully upgraded capabilities and efficiency.
The headline number is the 1M native context window, up from GPT-5's 128K. That's an 8x increase and puts GPT-5.4 in direct competition with Google's Gemini models on context length. Combined with improved structured output reliability and native tool use, it positions GPT-5.4 as OpenAI's answer to the agentic AI race.
At a Glance
Genuinely Impressive:
- • 1M native context window that holds quality to ~800K
- • 83% on Humanity's Last Exam (with tools) — record score
- • 38% fewer hallucinations than GPT-5
- • GPQA Diamond 78.2% — strongest general reasoning score
- • Native structured output mode (JSON, function calling)
Where It Falls Short:
- • Still hallucinates — ~8% error rate in factual claims
- • SWE-Bench 79.1% trails Claude Opus 4.6 (80.8%)
- • Plus rate limits (40 msgs/3hr) frustrate daily users
- • Pro tier at $200/month is hard to justify for most individuals
- • API latency higher than GPT-5 for standard requests
How We Tested
This review is based on nine days of hands-on testing from March 12–20, 2026. We used GPT-5.4 across both the ChatGPT interface (Plus and Pro tiers) and the API directly, comparing outputs against Claude Opus 4.6 and Gemini 3.1 Pro on identical prompts.
Long-Context Stress Tests (8 sessions)
Fed GPT-5.4 progressively larger documents — from 50K to 900K tokens — and tested retrieval accuracy, summarization quality, and instruction-following at various context depths. Compared against Gemini's 2M context and Claude's 1M beta.
Coding Tasks (10 sessions)
Multi-file refactoring, bug fixing from error logs, test generation, and API integration across TypeScript, Python, and Go. Ran identical tasks on Claude Opus 4.6 and Gemini 3.1 Pro for direct comparison. Logged completion quality, token consumption, and time-to-solution.
Factual Accuracy Audit (6 sessions)
Generated 50 long-form responses on verifiable topics (science, history, technology, law) and manually fact-checked every claim. Counted hallucination rate per 100 factual assertions and compared against GPT-5 and Claude Opus 4.6 on the same prompts.
Multimodal Evaluation (5 sessions)
Image analysis, chart interpretation, document OCR, and screenshot understanding. GPT-5.4's multimodal capabilities are one of its differentiators — we tested whether the quality improvements extend beyond text.
Third-Party Benchmark Verification
Cross-referenced OpenAI's published benchmarks against independent results from Artificial Analysis, LMSYS Chatbot Arena, and Scale AI's SEAL leaderboard. Benchmark figures cited in this review match or closely approximate independent results.
All API testing was conducted at standard pricing with no sponsored access or credits from OpenAI. Token costs reported are actual billed amounts from our account.
Key Features That Actually Matter
OpenAI's announcement listed over a dozen improvements. Three of them make a real difference in daily use. The rest are incremental.
The Three Features Worth Paying Attention To
1M Native Context Window
GPT-5's 128K context was functional but limiting for large codebases and document analysis. GPT-5.4 jumps to 1M tokens natively — not a beta feature, not a workaround, just a standard capability. You can feed it an entire mid-sized codebase or a 400-page PDF and ask questions across the full content.
The practical limit is around 800K tokens. Beyond that, we noticed quality degradation in retrieval tasks — the model starts missing details buried in the middle of very long contexts. This is a known issue across all long-context models (the "lost in the middle" problem), but GPT-5.4 handles it better than GPT-5 did at its 128K ceiling.
Structured Output Mode
GPT-5 already had JSON mode, but GPT-5.4 takes it further with guaranteed schema compliance. When you specify a JSON schema in the API, the model's output is constrained to match it exactly — no extra fields, no missing required properties, no malformed JSON. This sounds mundane but it eliminates an entire category of production bugs.
In our testing, structured output compliance was 99.7% across 300 API calls. The remaining 0.3% were edge cases with deeply nested optional fields. For comparison, GPT-5's JSON mode hit roughly 96% compliance, and Claude Opus 4.6's tool_use mode achieves about 99.2%.
Improved Reasoning Chain
OpenAI merged the o-series reasoning approach directly into GPT-5.4. Instead of choosing between "GPT-5" and "o3-mini," you get a single model that automatically allocates reasoning depth based on task complexity. Simple questions get fast answers; hard problems trigger deeper analysis with internal chain-of-thought.
This is conceptually similar to Anthropic's Adaptive Thinking in Opus 4.6. The result: GPQA Diamond jumped from 71.4% (GPT-5) to 78.2%, which is the highest general reasoning score among frontier models. The tradeoff is the same as with Opus — harder problems consume more tokens and cost more.
Other Notable Upgrades
Benchmark Deep Dive
GPT-5.4's benchmark story is nuanced. It leads on general reasoning and multimodal tasks but doesn't top the coding leaderboards. The Humanity's Last Exam score grabs headlines, but the fine print matters.
| Benchmark | GPT-5.4 | What It Measures |
|---|---|---|
| Humanity's Last Exam | 83.0% (tools) | Expert-level questions across all academic fields. "With tools" means the model can use web search and code execution. Without tools: ~61%. |
| GPQA Diamond | 78.2% | Graduate-level physics, chemistry, and biology questions. Tests deep scientific reasoning without tool use. |
| SWE-Bench Verified | 79.1% | Resolving real GitHub issues. Strong but behind Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%). |
| MATH-500 | 96.4% | Competition-level mathematics. Near-ceiling performance alongside other frontier models. |
| SimpleQA | 47.8% | Factual accuracy on simple questions. Higher is better. GPT-5 scored 38.2%, so 47.8% reflects the hallucination reduction. |
The Humanity's Last Exam score needs context. The 83% figure is "with tools" — meaning GPT-5.4 had access to web search and a Python code interpreter during the evaluation. Without tools, the score drops to approximately 61%. Both numbers are records, but the gap between tool-assisted and unassisted performance is wider than for other models, suggesting GPT-5.4 is particularly effective at leveraging tools rather than having superior raw knowledge.
The SimpleQA improvement from 38.2% to 47.8% maps roughly to the claimed "38% fewer hallucinations." It's real progress, but 47.8% accuracy on simple factual questions means the model still gets more than half of basic facts wrong on this challenging benchmark. Don't turn off your fact-checking instincts yet.
Pricing Breakdown
OpenAI's pricing for GPT-5.4 follows a familiar tiered approach. The API is competitively priced against Claude Opus 4.6, while the consumer tiers maintain the $20/$200 split that frustrates power users who fall between those extremes.
API Pricing
| Tier | Input / MTok | Output / MTok | Notes |
|---|---|---|---|
| Standard (256K) | $2.50 | $10.00 | 256K context, 64K output |
| Extended (1M) | $5.00 | $20.00 | 1M context window |
| Realtime | $10.00 | $40.00 | Voice mode, streaming |
| Batch API | $1.25 | $5.00 | 50% off, async processing |
The API pricing is genuinely competitive. At $2.50/$10 per MTok for standard context, GPT-5.4 is half the price of Claude Opus 4.6 ($5/$25 per MTok) on both input and output tokens. The Batch API at $1.25/$5 per MTok is particularly compelling for data processing pipelines and bulk analysis tasks.
The consumer pricing tells a different story. The $20 Plus tier is reasonable but the rate limits are a constant friction point. The jump to $200 for Pro is steep with nothing in between. Anthropic offers Max 5x at $100/month as a middle tier — OpenAI has no equivalent. If you're outgrowing Plus but don't need $200/month worth of capacity, you're stuck.
In our testing, average API cost per coding task ran about $0.08–0.30 for standard completions and $1–3 for complex multi-step reasoning tasks. Roughly 40–60% cheaper than equivalent Opus 4.6 API calls for the same prompts.
GPT-5.4 vs Claude Opus 4.6
This is the matchup everyone wants to see. GPT-5.4 and Opus 4.6 are the two most capable general-purpose models available in March 2026. Their strengths diverge enough that the "winner" depends entirely on your use case.
| Dimension | GPT-5.4 | Claude Opus 4.6 | Winner |
|---|---|---|---|
| SWE-Bench Verified | 79.1% | 80.8% | Opus 4.6 |
| GPQA Diamond | 78.2% | 74.9% | GPT-5.4 |
| Context Window | 1M native | 200K (1M beta) | GPT-5.4 |
| API Input / MTok | $2.50 | $5.00 | GPT-5.4 (2x cheaper) |
| Max Output Tokens | 64K | 128K | Opus 4.6 (2x more) |
| Agentic Coding | Codex (beta) | Claude Code + Agent Teams | Opus 4.6 |
| Multimodal | Image + Audio + Video | Image only | GPT-5.4 |
The pattern is clear: GPT-5.4 leads on general reasoning, multimodal capabilities, context length, and price. Opus 4.6 leads on agentic coding, output length, and the integrated development ecosystem. If you're primarily writing code using an AI coding agent, Opus 4.6's 1.7-point SWE-Bench lead plus Claude Code's Agent Teams makes it the stronger choice.
If you're doing research, analysis, document processing, or building applications that need broad AI capabilities, GPT-5.4 offers more at a lower price point. The 1M native context alone is a significant advantage for anyone working with large documents or codebases who doesn't want to pay for Opus's premium extended context tier.
For a deeper look at how the $20/month consumer tiers compare in daily use, see our ChatGPT Plus vs Claude Pro comparison.
GPT-5.4 vs Gemini 3.1 Pro
Gemini 3.1 Pro remains the disruptor in this three-way race. It matches or beats both GPT-5.4 and Opus 4.6 on several coding benchmarks while costing a fraction of either. But it lacks the ecosystem depth.
Key Differences
GPT-5.4 Advantages:
- • Stronger general reasoning (GPQA 78.2% vs 76.4%)
- • Superior multimodal (audio, video understanding)
- • Richer ecosystem (GPTs, Operator, plugins)
- • Better voice mode and real-time interactions
- • Image generation via DALL-E integration
Gemini 3.1 Pro Advantages:
- • Dramatically cheaper ($0.075/MTok vs $2.50/MTok input)
- • 2M native context window vs 1M
- • SWE-Bench 80.6% beats GPT-5.4's 79.1%
- • ARC-AGI-2 77.1% vs 72.3% (abstract reasoning)
- • Gemini CLI for terminal-based coding
The cost gap is the defining factor. Gemini 3.1 Pro is roughly 33x cheaper on input tokens than GPT-5.4 at standard pricing. For high-volume API workloads, that difference compounds fast. A pipeline processing 10 million tokens daily costs about $25 on Gemini versus $825 on GPT-5.4.
The practical recommendation: Gemini 3.1 Pro for cost-sensitive, high-volume, and coding-centric workloads. GPT-5.4 for consumer-facing products that benefit from the ChatGPT ecosystem, multimodal capabilities, and broad general reasoning.
Real Limitations and Downsides
GPT-5.4 is a strong model. It's also a model with real problems that affect daily use. Here's what OpenAI's launch blog didn't emphasize.
Hallucinations Are Reduced, Not Solved
The "38% fewer hallucinations" number is real but misleading if you interpret it as "mostly reliable." In our factual accuracy audit of 50 long-form responses, roughly 1 in 12 factual claims contained errors. The model still invents plausible-sounding API endpoints, fabricates citation details, and confidently states incorrect dates or statistics.
Particularly problematic for technical documentation: GPT-5.4 generated three fictional npm package names in a single response about Node.js security libraries. The packages sounded real but didn't exist.
Plus Rate Limits Are Frustrating
At $20/month, ChatGPT Plus gives you approximately 40 GPT-5.4 messages per 3 hours. During peak hours (US working hours), that drops to 25–30. For anyone using GPT-5.4 as their primary work tool, you'll hit the ceiling by mid-morning. The only solution is the $200/month Pro tier — there's no $50 or $100 option.
$200/Month Pro Is Hard to Justify
ChatGPT Pro costs $200/month for near-unlimited GPT-5.4 access. Claude's Max 20x also costs $200/month but includes Agent Teams, which is a genuinely differentiated capability. Unless you specifically need OpenAI's multimodal features or Operator integration, the Claude Max tier offers more for the same price.
Agentic Coding Lags Behind
GPT-5.4 is a strong coder but not the strongest. SWE-Bench at 79.1% puts it behind both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%). OpenAI's Codex agent is still in beta and doesn't match Claude Code's maturity. For developers who live in the terminal and need autonomous coding, Anthropic's ecosystem has the edge.
Context Quality Degrades Past 800K
The 1M context window is a genuine capability but quality isn't uniform across the full window. In our testing, retrieval accuracy on information placed between 800K–1M tokens dropped by roughly 15–20% compared to information in the first 200K. For most use cases, treat the practical context limit as ~800K rather than 1M.
Who Should Use GPT-5.4
GPT-5.4 makes sense for:
- • Researchers and analysts processing large documents — the 1M native context is genuinely useful for legal contracts, research papers, and codebase analysis without chunking
- • Teams building AI products on the OpenAI API — the $2.50/$10 per MTok pricing is competitive, structured output mode is production-ready, and the batch API at $1.25/$5 is excellent value
- • Multimodal application developers — image, audio, and video understanding in a single API with native tool use is something only OpenAI offers at this quality level
- • Anyone already invested in the ChatGPT ecosystem — custom GPTs, Operator, plugins, and the conversation history make switching costly
- • Science and research work — GPQA Diamond 78.2% is the highest reasoning score, and the tool-use integration for research is mature
Consider alternatives if:
- • Agentic coding is your primary use case — Claude Opus 4.6 + Claude Code + Agent Teams is the stronger setup for autonomous software engineering tasks
- • Budget is your biggest constraint — Gemini 3.1 Pro delivers comparable coding performance at roughly 33x lower API costs
- • You need very long outputs — GPT-5.4 caps at 64K output tokens versus Opus 4.6's 128K
- • Factual accuracy is critical — while improved, hallucination rates are still too high for unsupervised use in medical, legal, or financial contexts
- • You're a power user on a budget — the $20-to-$200 gap with no middle tier makes Plus inadequate and Pro expensive
The crossover point is clear: if your work is primarily code-writing in a terminal, Claude's ecosystem wins. If your work is broader — research, analysis, multimodal tasks, or building products on an AI API — GPT-5.4 offers the better package. For a comprehensive look at how the coding tools stack up, see our AI coding tools comparison.
NeuronWriter
Optimize your AI-written content for search engines with NLP-powered content scoring
GamsGo
Get ChatGPT Plus and Claude Pro at 30-40% off via group subscriptions — code WK2NU
Frequently Asked Questions
How much does GPT-5.4 cost?
ChatGPT Plus ($20/month) includes rate-limited GPT-5.4 access with ~40 messages per 3 hours. ChatGPT Pro ($200/month) provides near-unlimited access with 1M context. The API costs $2.50 per million input tokens and $10 per million output tokens at standard pricing, with a batch API at half price ($1.25/$5 per MTok). Extended 1M context is $5/$20 per MTok.
Is GPT-5.4 better than Claude Opus 4.6?
It depends on your use case. GPT-5.4 leads on general reasoning (GPQA Diamond 78.2% vs 74.9%), 1M native context, multimodal capabilities, and API price ($2.50 vs $5 per MTok input). Claude Opus 4.6 leads on agentic coding (SWE-Bench 80.8% vs 79.1%), output length (128K vs 64K tokens), and the Claude Code ecosystem. Developers doing heavy coding generally prefer Opus; researchers and product builders often prefer GPT-5.4.
Does GPT-5.4 still hallucinate?
Yes. OpenAI claims 38% fewer hallucinations than GPT-5, and our testing confirms measurable improvement. But GPT-5.4 still fabricates citations, invents plausible-sounding package names, and states incorrect facts with confidence. In our audit, roughly 1 in 12 factual claims in long-form outputs contained errors. Always verify important claims, especially for technical, medical, or legal content.
What is the GPT-5.4 context window size?
1M tokens natively, up from 128K in GPT-5. ChatGPT Plus users get 256K in the web interface; Pro users get the full 1M. The API offers 256K at standard pricing ($2.50/MTok input) and 1M at extended pricing ($5/MTok input). Quality holds well through ~800K tokens, with noticeable degradation in retrieval accuracy for information placed in the 800K–1M range.
What are the rate limits on ChatGPT Plus for GPT-5.4?
ChatGPT Plus allows roughly 40 GPT-5.4 messages per 3-hour window. During peak demand (typically US business hours), this can drop to 25–30 messages. The limits reset on a rolling basis. ChatGPT Pro ($200/month) provides approximately 10x the Plus limits with priority queue access. OpenAI adjusts these limits dynamically, so exact numbers fluctuate.
Final Verdict
GPT-5.4 is OpenAI's most capable model and a genuine contender for the overall frontier crown. The 1M native context window is a real differentiator. The GPQA and Humanity's Last Exam scores reflect meaningful reasoning improvements. The API pricing at $2.50/$10 per MTok undercuts Claude Opus 4.6 by half while delivering comparable quality on most tasks.
It also still hallucinates, punishes Plus subscribers with tight rate limits, offers no middle-ground pricing between $20 and $200, and trails in the agentic coding race that's become the key battleground for developer tools. The 64K output cap feels limiting when Opus gives you 128K.
The AI model landscape in March 2026 is a three-way split: OpenAI leads on general reasoning and multimodal breadth, Anthropic leads on agentic coding and developer tooling, and Google leads on price-to-performance. GPT-5.4 doesn't win every category, but it's competitive in all of them — and that breadth is its real strength.
Our Recommendation
GPT-5.4 is a strong all-rounder that does nothing badly and several things exceptionally well. Whether it's the right model for you depends less on benchmarks and more on where you spend your time. General reasoning and research? GPT-5.4. Coding agents? Opus 4.6. Budget-conscious? Gemini 3.1 Pro. The frontier has room for all three.