Gemini 3.1 Pro Review: Google's Cheapest Flagship Model Tested

Q: What are Gemini 3.1 Pro's four thinking modes?

Gemini 3.1 Pro offers four inference modes: Flash (no thinking, fastest and cheapest), Lite (minimal reasoning for simple queries), Standard (balanced reasoning for most tasks), and Deep Think (extended chain-of-thought for hard math, science, and coding problems). You select the mode via the API's thinking_budget parameter. Standard mode is the default and covers most use cases. Deep Think adds latency but significantly improves performance on multi-step reasoning.

Q: How does Gemini 3.1 Pro compare to Claude Opus 4.6?

On SWE-bench Verified (autonomous coding), Gemini 3.1 Pro scores 80.6% vs Claude Opus 4.6's 77.1% — a genuine lead. On ARC-AGI-2 (abstract reasoning), Gemini 3.1 Pro scores 77.1% vs Claude Opus 4.6's 43.2%. However, Claude Opus 4.6 still has advantages: stronger long-form instruction following, better tool use reliability in complex agentic loops, and a more mature ecosystem of integrations. Gemini 3.1 Pro is the better choice on benchmarks and price; Claude Opus 4.6 holds an edge in production agentic workflows.

Q: How does Gemini 3.1 Pro compare to GPT-5.4?

Gemini 3.1 Pro and GPT-5.4 are closely matched on coding tasks. Gemini 3.1 Pro scores 80.6% on SWE-bench vs GPT-5.4's 79.2%. On ARC-AGI-2, Gemini 3.1 Pro scores 77.1% vs GPT-5.4's 65.3%. Pricing strongly favors Gemini: $2/$12 vs GPT-5.4's $2.50/$15 per million tokens. GPT-5.4 has better OpenAI ecosystem integration (Assistants API, function calling history, DALL-E integration) and more predictable structured output. For raw capability per dollar, Gemini 3.1 Pro is the stronger choice.

Q: How do I access Gemini 3.1 Pro via API?

Gemini 3.1 Pro is available through Google AI Studio (aistudio.google.com) and the Vertex AI API. For direct API access, get an API key from Google AI Studio and use the model ID gemini-3.1-pro in your requests. The Google AI SDK for Python, Node.js, Go, and Kotlin all support Gemini 3.1 Pro. You can also access it via OpenAI-compatible endpoints using the Gemini API's compatibility layer, making migration from GPT-4 or GPT-5 straightforward.

What Is Gemini 3.1 Pro?

Gemini 3.1 Pro is Google DeepMind's latest flagship model, released on February 19, 2026. It is the first model in the Gemini 3.x series and represents a significant step up from Gemini 2.5 Pro — most notably on abstract reasoning and coding tasks. For how the previous generation compared, see our Gemini 2.5 Pro review.

The headline number from Google's release: 77.1% on ARC-AGI-2, up from Gemini 2.5 Pro's 31.1%. ARC-AGI-2 (Abstraction and Reasoning Corpus, second generation) is considered one of the hardest AI reasoning benchmarks because it tests pattern generalization on problems that models could not have seen in training. A 46-percentage-point jump in one generation is unusual — and it reflects a fundamental architectural change in how the model approaches novel problems, not just more training data.

The second highlight is pricing. At $2.00 input / $12.00 output per million tokens, Gemini 3.1 Pro is the cheapest frontier model available. That gap matters when running high-volume applications.

The model supports up to a 2 million token context window (with billing tiers), 4 inference thinking modes, multimodal input (text, images, audio, video, documents), and native function calling with structured output.

How We Tested

We evaluated Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4 across four task categories over three weeks in February–March 2026:

Coding tasks — 25 real GitHub issues pulled from open-source projects (Next.js, FastAPI, Go stdlib). Scored by whether the generated fix resolves the issue without introducing regressions (manual review).
Reasoning tasks — 40 multi-step logic problems, math proofs, and planning tasks. Scored for correct final answer and reasoning chain quality.
Long-context tasks — Processing 500K+ token documents (legal contracts, research papers). Scored on extraction accuracy and summarization quality.
Agentic tool use — 15 tool-calling workflows including database queries, API chains, and file operations. Scored on task completion rate and error recovery.

We used Standard thinking mode for Gemini 3.1 Pro in all tests unless specified, to match a realistic API usage pattern. Model versions: gemini-3.1-pro, claude-opus-4-6-20260219, gpt-5.4-turbo.

Third-party benchmark data: ARC-AGI-2 scores from ARC Prize leaderboard. SWE-bench from swebench.com. MMLU from Papers With Code.

Benchmark Results: ARC-AGI-2, SWE-bench, MMLU

The benchmark picture strongly favors Gemini 3.1 Pro on coding and abstract reasoning. The gap is less clear on language understanding.

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6	GPT-5.4
ARC-AGI-2	77.1%	43.2%	65.3%
SWE-bench Verified	80.6%	77.1%	79.2%
MMLU (5-shot)	91.4%	92.1%	91.8%
MATH (competition)	89.3%	85.7%	87.1%
HumanEval (coding)	97.6%	96.4%	96.9%
Context window	2M tokens	200K tokens	512K tokens

The ARC-AGI-2 gap is the most striking number. Going from 31.1% to 77.1% in one model generation represents a qualitative shift in generalization capability. Claude Opus 4.6 at 43.2% is still strong — better than any model from early 2025 — but Gemini 3.1 Pro is in a different tier on this benchmark.

SWE-bench differences are tighter: a 3.5-point lead over Claude Opus 4.6 (80.6% vs 77.1%) and a 1.4-point lead over GPT-5.4. At the top of the benchmark range these differences are real but not decisive — all three models resolve around 4 in 5 real GitHub issues autonomously, which is extraordinary compared to where these tools were eighteen months ago.

MMLU (general language understanding) shows Claude Opus 4.6 with a slight edge, 92.1% vs 91.4%. Practically speaking, all three are within noise range on language tasks.

Four Thinking Modes Explained

One of Gemini 3.1 Pro's most practical features is its tiered thinking system. Unlike models where you toggle "thinking on/off," Gemini 3.1 Pro lets you set a continuous thinking_budget that maps to four named modes:

Mode	Thinking Tokens	Latency	Best For
Flash	0	~1s	Classification, extraction, simple Q&A
Lite	~500	~3s	Summarization, translation, light analysis
Standard	~2,000	~8s	Code generation, complex instructions, most tasks
Deep Think	~8,000–32,000	~30–90s	Theorem proving, architecture design, hard coding tasks

In practice, Standard mode covers roughly 85% of real workloads at a latency that feels responsive for interactive use. Flash mode is worth using in pipelines where you are processing thousands of items and the tasks are genuinely simple — the cost and speed improvement is significant. Deep Think mode produces noticeably better results on multi-step math and complex refactoring tasks, but the 30–90 second wait makes it unsuitable for anything interactive.

One thing to know: thinking tokens are not billed separately — they are included in the standard output token pricing. A Deep Think response consuming 10,000 thinking tokens plus 500 output tokens is billed as 10,500 output tokens. At $12/million output tokens, an expensive Deep Think call might cost $0.006 more than a Standard call — negligible for single queries, meaningful at scale.

Coding Performance in Practice

The SWE-bench lead over Claude Opus 4.6 held up in our real-world tests — but the character of the difference was different from what we expected.

Gemini 3.1 Pro excelled on tasks requiring creative problem decomposition: finding non-obvious ways to refactor deeply coupled code, proposing architectural improvements rather than just patching, and generating cleaner abstractions. On our 25-issue test set, it resolved 20 correctly on the first attempt vs Claude Opus 4.6's 19.

Where Claude Opus 4.6 held its ground: consistency in agentic workflows. In our tool-use tests, Gemini 3.1 Pro's function calling was less reliable in long chains. Claude Opus 4.6 completed 13/15 multi-step tool workflows on the first run. Gemini 3.1 Pro completed 11/15, with two failures caused by malformed function call arguments after 6+ tool calls in sequence. This is a real limitation for production agents that need deterministic behavior.

For most developers building applications rather than complex agent pipelines, this distinction will not matter much. For teams building systems that depend on reliable multi-turn tool use, it is worth testing with your specific workflow before switching.

On long-context coding tasks — summarizing a 200K-token codebase, finding bugs across a large repo — Gemini 3.1 Pro's 2M token context window is a meaningful advantage over Claude's 200K. We fed it a ~300K token repository and asked it to identify all locations where a deprecated API was called. It found all 23 instances in one pass. Claude required an agentic search loop that took 3x longer.

Pricing: The Real Advantage

The pricing comparison is the clearest argument for Gemini 3.1 Pro if you are building API-heavy applications.

Model	Input ($/1M tokens)	Output ($/1M tokens)	Context Window	Free Tier
Gemini 3.1 Pro	$2.00	$12.00	2M tokens	Yes
Claude Opus 4.6	$15.00	$75.00	200K tokens	No
GPT-5.4	$2.50	$15.00	512K tokens	No
Claude Sonnet 4.6	$3.00	$15.00	200K tokens	No

At $2/$12, Gemini 3.1 Pro input is 7.5x cheaper than Claude Opus 4.6 and output is 6.25x cheaper. Even compared to the mid-tier Claude Sonnet 4.6, the output cost difference is meaningful at scale.

For a concrete example: an application making 100,000 API calls per day with an average of 500 input tokens and 1,000 output tokens per call would cost approximately $1,300/month with Gemini 3.1 Pro versus $9,375/month with Claude Opus 4.6 — a $8,075/month difference. For startups or high-volume applications, that is a significant operational difference.

The free tier via Google AI Studio is also worth noting for development and low-volume use. Rate limits apply, but for building prototypes or running occasional analyses, it costs nothing.

Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.4

Here is where each model leads and where it does not:

Category	Gemini 3.1 Pro	Claude Opus 4.6	GPT-5.4
API pricing	Winner ($2/$12)	Expensive ($15/$75)	Moderate ($2.50/$15)
ARC-AGI-2	Winner (77.1%)	43.2%	65.3%
SWE-bench coding	Winner (80.6%)	77.1%	79.2%
Context window	Winner (2M)	200K	512K
Agentic tool use reliability	Good	Winner	Good
Instruction following	Good	Winner	Good
OpenAI ecosystem fit	Moderate (compat layer)	Moderate	Winner (native)
Free tier	Yes	No	No
Multimodal input	Text/image/audio/video	Text/image	Text/image/audio

API Access and Integration

Gemini 3.1 Pro is accessible through two primary paths: Google AI Studio (for development and lower-volume use) and Vertex AI (for production with enterprise SLAs).

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3.1-pro")

# Standard mode (default)
response = model.generate_content("Explain this codebase...")

# Deep Think mode
response = model.generate_content(
    "Solve this algorithmic problem...",
    generation_config={"thinking_budget": 8000}
)

Google also provides an OpenAI-compatible API endpoint at https://generativelanguage.googleapis.com/v1beta/openai/, which accepts the standard OpenAI SDK format. This makes switching from GPT-5.4 a matter of changing the base URL and model ID — no SDK changes required.

The Google AI SDK supports Python, Node.js, Go, Java, and Kotlin. Streaming responses, function calling, JSON mode, and system instructions are all natively supported. For developers already on Google Cloud, Vertex AI adds IAM-based access control, VPC-SC support, and usage tied to existing GCP billing.

Genuine Downsides

Gemini 3.1 Pro is the benchmark leader and price winner, but it has real limitations worth knowing before committing.

Agentic tool use is less reliable in long chains. In our testing, function call accuracy degraded after 6+ consecutive tool calls. For applications that require 10–20 tool calls per session (database queries, multi-step pipelines), Claude Opus 4.6's reliability advantage is real. Google is aware of this and has noted improvements coming, but it is a current gap.
Deep Think latency is high. 30–90 seconds per request rules Deep Think out for anything user-facing. It is useful for batch processing and offline analysis, but if you need fast complex reasoning, it is not viable.
Google account and ToS dependency. The free tier requires a Google account. Some enterprise environments restrict Google API access. In regulated industries, understanding Google's data retention policies for AI Studio vs Vertex AI is important — they differ significantly.
Instruction following on highly specific formats. In our tests, Claude Opus 4.6 was more reliable when given complex, multi-constraint output format requirements (e.g., "Output exactly 5 JSON objects, each with fields X/Y/Z, no additional text"). Gemini 3.1 Pro occasionally added explanatory text or varied field ordering. Small difference but relevant for structured output pipelines.
No native integration with non-Google developer tools. GitHub Actions, Linear, Notion, and other developer tools have official Claude and GPT integrations. Gemini 3.1 Pro requires custom integration work or third-party connectors.

Verdict

Gemini 3.1 Pro is the most compelling price-to-performance frontier model available in early 2026. On pure benchmarks — ARC-AGI-2, SWE-bench, MATH — it leads both Claude Opus 4.6 and GPT-5.4. On API pricing, it is dramatically cheaper. On context window, it is the largest of the three. Our AI model comparison guide puts all three models side by side across every major benchmark and pricing tier.

The limitations are real but narrow: tool use reliability in long agentic chains, instruction following on complex output formats, and Deep Think's latency. None of these are dealbreakers for most use cases — they matter primarily for teams building production agent pipelines that require deterministic, multi-step tool orchestration.

Use Gemini 3.1 Pro if: you are building API-heavy applications, processing large documents, need a free tier for development, or want the best benchmark performance at the lowest cost. For self-hosted deployments, Google's open-source Gemma 4 models bring similar architecture to local hardware under Apache 2.0.

Stick with Claude Opus 4.6 if: you are building complex multi-turn agent pipelines where tool use reliability is paramount, or if you need the deepest integration with Claude Code's agentic ecosystem. See our Gemini CLI vs Claude Code comparison for how these models perform in a terminal coding context specifically.

For the best of both approaches, it is worth reading our Claude Opus vs GPT Codex comparison to understand the full landscape of coding-focused AI models.

FAQ

How much does Gemini 3.1 Pro API cost?

Gemini 3.1 Pro is priced at $2.00 per million input tokens and $12.00 per million output tokens. A free tier with rate limits is available through Google AI Studio. Compared to Claude Opus 4.6 ($15/$75) and GPT-5.4 ($2.50/$15), Gemini 3.1 Pro offers the best price among frontier models — particularly on output tokens.

What are Gemini 3.1 Pro's four thinking modes?

Flash (no thinking, ~1s latency), Lite (~500 thinking tokens, ~3s), Standard (~2,000 thinking tokens, ~8s), and Deep Think (~8,000–32,000 thinking tokens, 30–90s). You set the mode via the thinking_budget parameter in the API. Standard mode is the default and appropriate for most use cases. Deep Think adds significant latency but improves performance on hard reasoning tasks.

How does Gemini 3.1 Pro compare to Claude Opus 4.6?

Gemini 3.1 Pro leads on ARC-AGI-2 (77.1% vs 43.2%), SWE-bench (80.6% vs 77.1%), context window (2M vs 200K), and pricing ($2/$12 vs $15/$75). Claude Opus 4.6 leads on agentic tool use reliability and complex instruction following. Both are frontier models — the practical choice depends on your use case.

How does Gemini 3.1 Pro compare to GPT-5.4?

Gemini 3.1 Pro leads GPT-5.4 on ARC-AGI-2 (77.1% vs 65.3%), SWE-bench (80.6% vs 79.2%), context window (2M vs 512K), and pricing ($2/$12 vs $2.50/$15). GPT-5.4 has better native OpenAI ecosystem integration and tends to be more reliable on structured output tasks. For cost-performance ratio on API workloads, Gemini 3.1 Pro wins.

How do I access Gemini 3.1 Pro via API?

Get an API key from Google AI Studio and use the model ID gemini-3.1-pro with the Google AI SDK (Python, Node.js, Go, Kotlin supported). Google also provides an OpenAI-compatible endpoint so you can use the OpenAI SDK by changing only the base URL and model ID — no other code changes required.