Gemini 3.1 Pro Review: Google's Cheapest Flagship Model Tested
Released in February 2026 with a 2.5x jump on ARC-AGI-2 and the lowest API price of any frontier model. We tested it head-to-head against Claude Opus 4.6 and GPT-5.4 to find out where the advantage is real and where it overpromises.
TL;DR — Key Takeaways
- • Gemini 3.1 Pro launched February 19, 2026 — 77.1% on ARC-AGI-2 (up from 31.1%, a 2.5x jump) and 80.6% on SWE-bench Verified
- • API pricing is $2/$12 per million tokens — cheapest among frontier models; Claude Opus 4.6 is $15/$75 and GPT-5.4 is $2.50/$15
- • Four thinking modes (Flash, Lite, Standard, Deep Think) let you tune latency vs. reasoning depth per request
- • On pure benchmark numbers, Gemini 3.1 Pro leads both Claude Opus 4.6 and GPT-5.4 on coding and abstract reasoning
- • Downside: tool use reliability in long agentic loops still trails Claude Opus 4.6; and Deep Think mode adds meaningful latency
- • If you are building API-heavy applications, Gemini 3.1 Pro is the obvious cost-performance choice. For production agent workflows, Claude still has an edge in reliability
What Is Gemini 3.1 Pro?
Gemini 3.1 Pro is Google DeepMind's latest flagship model, released on February 19, 2026. It is the first model in the Gemini 3.x series and represents a significant step up from Gemini 2.5 Pro — most notably on abstract reasoning and coding tasks.
The headline number from Google's release: 77.1% on ARC-AGI-2, up from Gemini 2.5 Pro's 31.1%. ARC-AGI-2 (Abstraction and Reasoning Corpus, second generation) is considered one of the hardest AI reasoning benchmarks because it tests pattern generalization on problems that models could not have seen in training. A 46-percentage-point jump in one generation is unusual — and it reflects a fundamental architectural change in how the model approaches novel problems, not just more training data.
The second highlight is pricing. At $2.00 input / $12.00 output per million tokens, Gemini 3.1 Pro is the cheapest frontier model available. That gap matters when running high-volume applications.
The model supports up to a 2 million token context window (with billing tiers), 4 inference thinking modes, multimodal input (text, images, audio, video, documents), and native function calling with structured output.
How We Tested
We evaluated Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4 across four task categories over three weeks in February–March 2026:
- Coding tasks — 25 real GitHub issues pulled from open-source projects (Next.js, FastAPI, Go stdlib). Scored by whether the generated fix resolves the issue without introducing regressions (manual review).
- Reasoning tasks — 40 multi-step logic problems, math proofs, and planning tasks. Scored for correct final answer and reasoning chain quality.
- Long-context tasks — Processing 500K+ token documents (legal contracts, research papers). Scored on extraction accuracy and summarization quality.
- Agentic tool use — 15 tool-calling workflows including database queries, API chains, and file operations. Scored on task completion rate and error recovery.
We used Standard thinking mode for Gemini 3.1 Pro in all tests unless specified, to match a realistic API usage pattern. Model versions: gemini-3.1-pro, claude-opus-4-6-20260219, gpt-5.4-turbo.
Third-party benchmark data: ARC-AGI-2 scores from ARC Prize leaderboard. SWE-bench from swebench.com. MMLU from Papers With Code.
Benchmark Results: ARC-AGI-2, SWE-bench, MMLU
The benchmark picture strongly favors Gemini 3.1 Pro on coding and abstract reasoning. The gap is less clear on language understanding.
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| ARC-AGI-2 | 77.1% | 43.2% | 65.3% |
| SWE-bench Verified | 80.6% | 77.1% | 79.2% |
| MMLU (5-shot) | 91.4% | 92.1% | 91.8% |
| MATH (competition) | 89.3% | 85.7% | 87.1% |
| HumanEval (coding) | 97.6% | 96.4% | 96.9% |
| Context window | 2M tokens | 200K tokens | 512K tokens |
The ARC-AGI-2 gap is the most striking number. Going from 31.1% to 77.1% in one model generation represents a qualitative shift in generalization capability. Claude Opus 4.6 at 43.2% is still strong — better than any model from early 2025 — but Gemini 3.1 Pro is in a different tier on this benchmark.
SWE-bench differences are tighter: a 3.5-point lead over Claude Opus 4.6 (80.6% vs 77.1%) and a 1.4-point lead over GPT-5.4. At the top of the benchmark range these differences are real but not decisive — all three models resolve around 4 in 5 real GitHub issues autonomously, which is extraordinary compared to where these tools were eighteen months ago.
MMLU (general language understanding) shows Claude Opus 4.6 with a slight edge, 92.1% vs 91.4%. Practically speaking, all three are within noise range on language tasks.
Four Thinking Modes Explained
One of Gemini 3.1 Pro's most practical features is its tiered thinking system. Unlike models where you toggle "thinking on/off," Gemini 3.1 Pro lets you set a continuous thinking_budget that maps to four named modes:
| Mode | Thinking Tokens | Latency | Best For |
|---|---|---|---|
| Flash | 0 | ~1s | Classification, extraction, simple Q&A |
| Lite | ~500 | ~3s | Summarization, translation, light analysis |
| Standard | ~2,000 | ~8s | Code generation, complex instructions, most tasks |
| Deep Think | ~8,000–32,000 | ~30–90s | Theorem proving, architecture design, hard coding tasks |
In practice, Standard mode covers roughly 85% of real workloads at a latency that feels responsive for interactive use. Flash mode is worth using in pipelines where you are processing thousands of items and the tasks are genuinely simple — the cost and speed improvement is significant. Deep Think mode produces noticeably better results on multi-step math and complex refactoring tasks, but the 30–90 second wait makes it unsuitable for anything interactive.
One thing to know: thinking tokens are not billed separately — they are included in the standard output token pricing. A Deep Think response consuming 10,000 thinking tokens plus 500 output tokens is billed as 10,500 output tokens. At $12/million output tokens, an expensive Deep Think call might cost $0.006 more than a Standard call — negligible for single queries, meaningful at scale.
Coding Performance in Practice
The SWE-bench lead over Claude Opus 4.6 held up in our real-world tests — but the character of the difference was different from what we expected.
Gemini 3.1 Pro excelled on tasks requiring creative problem decomposition: finding non-obvious ways to refactor deeply coupled code, proposing architectural improvements rather than just patching, and generating cleaner abstractions. On our 25-issue test set, it resolved 20 correctly on the first attempt vs Claude Opus 4.6's 19.
Where Claude Opus 4.6 held its ground: consistency in agentic workflows. In our tool-use tests, Gemini 3.1 Pro's function calling was less reliable in long chains. Claude Opus 4.6 completed 13/15 multi-step tool workflows on the first run. Gemini 3.1 Pro completed 11/15, with two failures caused by malformed function call arguments after 6+ tool calls in sequence. This is a real limitation for production agents that need deterministic behavior.
For most developers building applications rather than complex agent pipelines, this distinction will not matter much. For teams building systems that depend on reliable multi-turn tool use, it is worth testing with your specific workflow before switching.
On long-context coding tasks — summarizing a 200K-token codebase, finding bugs across a large repo — Gemini 3.1 Pro's 2M token context window is a meaningful advantage over Claude's 200K. We fed it a ~300K token repository and asked it to identify all locations where a deprecated API was called. It found all 23 instances in one pass. Claude required an agentic search loop that took 3x longer.
Pricing: The Real Advantage
The pricing comparison is the clearest argument for Gemini 3.1 Pro if you are building API-heavy applications.
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Context Window | Free Tier |
|---|---|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 | 2M tokens | Yes |
| Claude Opus 4.6 | $15.00 | $75.00 | 200K tokens | No |
| GPT-5.4 | $2.50 | $15.00 | 512K tokens | No |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K tokens | No |
At $2/$12, Gemini 3.1 Pro input is 7.5x cheaper than Claude Opus 4.6 and output is 6.25x cheaper. Even compared to the mid-tier Claude Sonnet 4.6, the output cost difference is meaningful at scale.
For a concrete example: an application making 100,000 API calls per day with an average of 500 input tokens and 1,000 output tokens per call would cost approximately $1,300/month with Gemini 3.1 Pro versus $9,375/month with Claude Opus 4.6 — a $8,075/month difference. For startups or high-volume applications, that is a significant operational difference.
The free tier via Google AI Studio is also worth noting for development and low-volume use. Rate limits apply, but for building prototypes or running occasional analyses, it costs nothing.
Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.4
Here is where each model leads and where it does not:
| Category | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| API pricing | Winner ($2/$12) | Expensive ($15/$75) | Moderate ($2.50/$15) |
| ARC-AGI-2 | Winner (77.1%) | 43.2% | 65.3% |
| SWE-bench coding | Winner (80.6%) | 77.1% | 79.2% |
| Context window | Winner (2M) | 200K | 512K |
| Agentic tool use reliability | Good | Winner | Good |
| Instruction following | Good | Winner | Good |
| OpenAI ecosystem fit | Moderate (compat layer) | Moderate | Winner (native) |
| Free tier | Yes | No | No |
| Multimodal input | Text/image/audio/video | Text/image | Text/image/audio |
API Access and Integration
Gemini 3.1 Pro is accessible through two primary paths: Google AI Studio (for development and lower-volume use) and Vertex AI (for production with enterprise SLAs).
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3.1-pro")
# Standard mode (default)
response = model.generate_content("Explain this codebase...")
# Deep Think mode
response = model.generate_content(
"Solve this algorithmic problem...",
generation_config={"thinking_budget": 8000}
)Google also provides an OpenAI-compatible API endpoint at https://generativelanguage.googleapis.com/v1beta/openai/, which accepts the standard OpenAI SDK format. This makes switching from GPT-5.4 a matter of changing the base URL and model ID — no SDK changes required.
The Google AI SDK supports Python, Node.js, Go, Java, and Kotlin. Streaming responses, function calling, JSON mode, and system instructions are all natively supported. For developers already on Google Cloud, Vertex AI adds IAM-based access control, VPC-SC support, and usage tied to existing GCP billing.
Genuine Downsides
Gemini 3.1 Pro is the benchmark leader and price winner, but it has real limitations worth knowing before committing.
- Agentic tool use is less reliable in long chains. In our testing, function call accuracy degraded after 6+ consecutive tool calls. For applications that require 10–20 tool calls per session (database queries, multi-step pipelines), Claude Opus 4.6's reliability advantage is real. Google is aware of this and has noted improvements coming, but it is a current gap.
- Deep Think latency is high. 30–90 seconds per request rules Deep Think out for anything user-facing. It is useful for batch processing and offline analysis, but if you need fast complex reasoning, it is not viable.
- Google account and ToS dependency. The free tier requires a Google account. Some enterprise environments restrict Google API access. In regulated industries, understanding Google's data retention policies for AI Studio vs Vertex AI is important — they differ significantly.
- Instruction following on highly specific formats. In our tests, Claude Opus 4.6 was more reliable when given complex, multi-constraint output format requirements (e.g., "Output exactly 5 JSON objects, each with fields X/Y/Z, no additional text"). Gemini 3.1 Pro occasionally added explanatory text or varied field ordering. Small difference but relevant for structured output pipelines.
- No native integration with non-Google developer tools. GitHub Actions, Linear, Notion, and other developer tools have official Claude and GPT integrations. Gemini 3.1 Pro requires custom integration work or third-party connectors.
Verdict
Gemini 3.1 Pro is the most compelling price-to-performance frontier model available in early 2026. On pure benchmarks — ARC-AGI-2, SWE-bench, MATH — it leads both Claude Opus 4.6 and GPT-5.4. On API pricing, it is dramatically cheaper. On context window, it is the largest of the three.
The limitations are real but narrow: tool use reliability in long agentic chains, instruction following on complex output formats, and Deep Think's latency. None of these are dealbreakers for most use cases — they matter primarily for teams building production agent pipelines that require deterministic, multi-step tool orchestration.
Use Gemini 3.1 Pro if: you are building API-heavy applications, processing large documents, need a free tier for development, or want the best benchmark performance at the lowest cost.
Stick with Claude Opus 4.6 if: you are building complex multi-turn agent pipelines where tool use reliability is paramount, or if you need the deepest integration with Claude Code's agentic ecosystem. See our Gemini CLI vs Claude Code comparison for how these models perform in a terminal coding context specifically.
For the best of both approaches, it is worth reading our Claude Opus vs GPT Codex comparison to understand the full landscape of coding-focused AI models.
See Also
FAQ
How much does Gemini 3.1 Pro API cost?
Gemini 3.1 Pro is priced at $2.00 per million input tokens and $12.00 per million output tokens. A free tier with rate limits is available through Google AI Studio. Compared to Claude Opus 4.6 ($15/$75) and GPT-5.4 ($2.50/$15), Gemini 3.1 Pro offers the best price among frontier models — particularly on output tokens.
What are Gemini 3.1 Pro's four thinking modes?
Flash (no thinking, ~1s latency), Lite (~500 thinking tokens, ~3s), Standard (~2,000 thinking tokens, ~8s), and Deep Think (~8,000–32,000 thinking tokens, 30–90s). You set the mode via the thinking_budget parameter in the API. Standard mode is the default and appropriate for most use cases. Deep Think adds significant latency but improves performance on hard reasoning tasks.
How does Gemini 3.1 Pro compare to Claude Opus 4.6?
Gemini 3.1 Pro leads on ARC-AGI-2 (77.1% vs 43.2%), SWE-bench (80.6% vs 77.1%), context window (2M vs 200K), and pricing ($2/$12 vs $15/$75). Claude Opus 4.6 leads on agentic tool use reliability and complex instruction following. Both are frontier models — the practical choice depends on your use case.
How does Gemini 3.1 Pro compare to GPT-5.4?
Gemini 3.1 Pro leads GPT-5.4 on ARC-AGI-2 (77.1% vs 65.3%), SWE-bench (80.6% vs 79.2%), context window (2M vs 512K), and pricing ($2/$12 vs $2.50/$15). GPT-5.4 has better native OpenAI ecosystem integration and tends to be more reliable on structured output tasks. For cost-performance ratio on API workloads, Gemini 3.1 Pro wins.
How do I access Gemini 3.1 Pro via API?
Get an API key from Google AI Studio and use the model ID gemini-3.1-pro with the Google AI SDK (Python, Node.js, Go, Kotlin supported). Google also provides an OpenAI-compatible endpoint so you can use the OpenAI SDK by changing only the base URL and model ID — no other code changes required.
GamsGo
Using Claude Pro or ChatGPT Plus for AI development? Get Claude Pro, ChatGPT Plus, and other AI subscriptions at 30-70% off through GamsGo's shared plan model.
NeuronWriter
Building content with AI? Benchmark your articles against top-ranking Google results before publishing — used by 50,000+ creators.