Google Gemma 4 Review — Open-Source AI I Actually Switched To

Q: Is Google Gemma 4 free to use?

Yes. Gemma 4 model weights are released under Apache 2.0, which means you can download, modify, fine-tune, and deploy them commercially at no cost. If you use Gemma 4 through Google AI Studio or Vertex AI, there are API usage fees — roughly $0.15-0.60 per million tokens depending on the variant. Running it locally on your own hardware through Ollama or vLLM costs nothing beyond electricity.

Q: What variants does Gemma 4 come in?

Gemma 4 ships in four variants: E2B (2.3B parameters, mobile/edge), E4B (4.5B, lightweight local), 26B MoE (26B total but only 3.8B active parameters via mixture-of-experts), and 31B Dense (the flagship). The 26B MoE is the sweet spot for local deployment — fast because only 3.8B params activate per query. The 31B Dense is ranked roughly #3 on Arena AI and competes with models 20x its size.

Q: Can I run Gemma 4 locally on my laptop?

The E2B (2.3B) runs on most modern laptops with no GPU. The E4B (4.5B) needs about 8GB VRAM. The 26B MoE is the local sweet spot — despite 26B total params, only 3.8B activate per query, so it runs well on a 16GB GPU. The 31B Dense needs 24GB+ VRAM or runs slowly on CPU. Ollama supports all variants — one command to pull and start chatting. Quantized versions reduce memory by roughly 40-50%.

Q: Does Gemma 4 support function calling and tool use?

Yes. Gemma 4 has native function calling with structured JSON output and multi-step planning. This works well on the 31B Dense and 26B MoE variants (the E2B/E4B are too small for reliable tool chains). In my testing, the 31B handled 2-3 step chains reliably and 4-5 step chains about 60% of the time. It occasionally hallucinated tool parameters. Compared to GPT-5 or Claude, the error rate on complex chains is roughly 2-3x higher.

What Is Google Gemma 4 and Why Should You Care?

Gemma 4 is Google DeepMind's latest open-weight AI model family, released around April 2, 2026. It builds on the Gemma 3 line, but the jump is not incremental. Google rewrote the architecture to support native multimodal input (text, images, video, and audio), added first-class function calling for agentic workflows, expanded language coverage to 140+, and — this is the part that got my attention — kept the entire thing under Apache 2.0 with no MAU limits and no acceptable use policy restrictions.

That license matters. Apache 2.0 means you can download the weights, fine-tune them on your proprietary data, deploy them commercially, and never pay Google a cent. Meta's Llama 4 has a community license with some restrictions. Gemma 4 has none.

The model comes in four variants: E2B (2.3B parameters, edge/mobile), E4B (4.5B, lightweight local), 26B MoE (26B total but only 3.8B active per query via mixture-of-experts — the local deployment sweet spot), and 31B Dense (the flagship). The 31B Dense is ranked roughly #3 on Arena AI and scores about 85% on MMLU Pro and 89% on AIME 2026. I spent most of my testing on the 31B Dense, with experiments on the 26B MoE to see how it handles lighter hardware.

Google positions Gemma 4 as “enterprise-grade open source” — meaning it is meant for production deployments, not just research experiments. The 31B beating models 20x its size on Arena AI suggests this is not just marketing. I run five websites on various Next.js stacks, so I had plenty of real code to throw at it. For the proprietary counterpart — the cloud-hosted model Gemma 4 distills knowledge from — see our Gemini 2.5 Pro review.

How We Tested This

I ran Gemma 4 from April 2 through April 8, 2026 across two setups:

• Local machine: RTX 4090 (24GB VRAM), running the 31B Dense and 26B MoE via Ollama on Ubuntu. Also tested the 31B Dense quantized (Q5_K_M) to see quality trade-offs
• Cloud API: Google AI Studio free tier and Vertex AI for comparison. Used the same prompts to benchmark local vs cloud performance
• Test projects: A Next.js 15 SaaS app with Prisma and Docker (~8K lines), a Next.js 16 site with Supabase (~5K lines), and a Python data pipeline (~2K lines)
• Tasks tested: Code review, bug diagnosis from screenshots, function calling chains (3-5 steps), code generation, refactoring across multiple files, and translation (English to Chinese)
• Comparison models: Llama 4 Scout 109B (via Together AI), Claude 3.5 Sonnet, GPT-5, Qwen 3.5 27B, and Gemma 3 27B (as baseline)

Total API cost for the cloud portion: roughly $8 over the week. Local runs cost nothing beyond the electricity to keep my GPU warm.

Which Gemma 4 Variant Do You Actually Need?

This is the first decision you need to make with Gemma 4, and it shapes everything else. I wasted about half a day trying to use the E2B for development work before accepting that it is fundamentally a different tool than the 31B Dense. They share a name, but the gap in capability is enormous.

Gemma 4 E2B (2.3B) and E4B (4.5B) — Mobile, Edge, and Light Tasks

The E2B is fast. Impressively fast. It runs on my phone through MediaPipe and responds in under a second. But it cannot do serious code generation. I asked it to write a React component with state management and it produced something that looked right but had subtle bugs in the useEffect cleanup. The E4B is a step up — it handles simple single-file tasks and basic Q&A better — but still not something I would use for real development work. Both have 128K token context windows.

Gemma 4 26B MoE — The Local Deployment Sweet Spot

This is where things get interesting. The 26B MoE has 26 billion total parameters, but the mixture-of-experts architecture means only about 3.8 billion activate per query. The result: it fits on a 16GB GPU and responds nearly as fast as the smaller models — roughly 1.5-2 seconds per query on my M2 Pro MacBook — while delivering quality closer to the 31B Dense than you would expect.

It scores about 83% on MMLU Pro and ranks roughly #6 on Arena AI. For practical coding work, it wrote correct Prisma schema migrations, generated working API route handlers, and caught about 70% of the bugs I deliberately introduced in test code. Where it struggles is anything requiring holding many files in context simultaneously. The 256K context window helps, but quality degrades on complex multi-file tasks. Honestly, this is better than I expected — Gemma 3 at the same price point could not even attempt this.

Gemma 4 31B Dense — The Flagship

The 31B Dense is the reason this review exists. It is the first open-weight model I have used where I genuinely considered switching some of my daily Claude API calls to a local model. Not all of them — but for routine code review and generation tasks, the 31B was close enough that the savings mattered. It scores roughly 85% on MMLU Pro, 89% on AIME 2026, and sits at #3 on Arena AI. Those numbers put it in the same conversation as models with hundreds of billions of parameters.

On my RTX 4090, it runs at about 25-30 tokens per second with full-precision weights. The quantized Q5 version bumps that to roughly 40 tokens/second with a barely noticeable quality drop. It has a 256K token context window. For context: Claude through the API feels near-instant because of their infrastructure, while Gemma 4 31B locally has a noticeable 3-5 second delay on longer responses. Tolerable, not instant.

How Good Is the Multimodal Understanding?

Gemma 4's multimodal capability was the feature I was most skeptical about. Open-source models have historically been text-only or had bolted-on vision that felt like an afterthought. Gemma 4's approach is different — it natively handles text, images, video, and audio through an integrated architecture, not bolted-on modules.

I tested this in three ways that map to my actual daily workflow:

Screenshot-Based Bug Reports

I took screenshots of CSS layout issues from one of my sites — a sidebar that overflowed on mobile, a modal that did not center properly, and a table that broke on narrow screens. I fed each screenshot to Gemma 4 31B with the prompt “What is wrong with this layout and how would you fix it in Tailwind CSS?”

It correctly identified the overflow issue and suggested overflow-x-auto on the container. For the modal centering, it gave a correct fix using fixed inset-0 flex items-center justify-center. The table fix was partially right — it suggested horizontal scrolling but missed that the header was also misaligned. Two out of three fully correct is not bad for a local model.

Diagram Understanding

I fed it architecture diagrams (hand-drawn on a whiteboard, photographed with my phone) and asked it to describe the system. It got the general structure right — identified boxes as services, arrows as data flow — but occasionally confused the direction of arrows and mislabeled one service. Claude handles this kind of task more accurately, to be fair.

Video Frame Analysis

Gemma 4 can process video by extracting key frames. I tested it with a short screen recording of a user flow on my site (about 15 seconds, ~8 frames). It described the navigation steps correctly and identified where the user paused, suggesting that might be a UX friction point. This was more of a demo than a practical workflow for me, but the capability is real.

For a broader look at how Google's AI models have evolved in their approach to multimodal, our Gemini 3.1 Pro review covers the proprietary side of Google's vision capabilities.

Does the Agentic Function Calling Actually Work?

Function calling is the feature that separates a chatbot from an agent. Gemma 4 ships with native support for it — you define tools as JSON schemas, include them in your system prompt, and the model generates structured function calls instead of plain text when appropriate.

I set up a test rig with five tools: a file reader, a file writer, a shell command executor, a web search stub, and a database query tool. Gemma 4 supports structured JSON output natively, so defining tool schemas was straightforward. Then I gave it tasks that required chaining these tools in sequence.

Simple Chains (2-3 calls): Reliable

Tasks like “read this file, find the bug, and write a fix” worked reliably on the 31B Dense. I ran roughly 20 of these and it completed about 17 correctly. The three failures were cases where the model called the right tool but with slightly wrong parameters — a file path off by one directory level, that sort of thing.

Complex Chains (4-5 calls): Hit or Miss

Longer chains like “search for the package version, update the dependency file, run the test suite, read the error output, and fix the failing test” worked maybe 60% of the time. The model tended to lose track of earlier context by step 4, sometimes re-reading a file it had already read or skipping a step entirely. Claude and GPT-5 handle these longer chains with roughly 85-90% reliability in my experience, so there is a meaningful gap.

The 26B MoE was surprisingly decent for 2-step chains but not reliable beyond that. Despite its efficiency, the mixture-of-experts routing sometimes produced malformed tool call JSON on step 3. The E4B and E2B are too small for agentic use entirely.

If you are building tools that use AI models as the reasoning engine, our AI coding tools guide compares the function calling capabilities across the major players.

How Do You Run Gemma 4 Locally?

Four options, depending on what you want:

1. Ollama (Easiest Local Setup)

Install Ollama, then run ollama pull gemma4:31b and ollama run gemma4:31b. That is it. The model downloads (roughly 18GB for the quantized version) and you are chatting in your terminal. Ollama handles memory management, quantization options, and exposes an OpenAI-compatible API on localhost. This is how I ran most of my tests.

For the 26B MoE: ollama pull gemma4:26b-moe. Since only ~3.8B params activate per query, it is surprisingly fast even on mid-range hardware.

2. Google AI Studio (Free Tier, Cloud)

Google AI Studio gives you free access to Gemma 4 through a web interface and API. The free tier has rate limits (roughly 15 requests per minute, 1,500 per day), but for experimentation and light usage it is sufficient. I used this to compare cloud-served Gemma 4 against my local runs — latency was about 3-4x better on Google's infrastructure, as you would expect.

3. Vertex AI (Production, Pay-Per-Use)

If you need an SLA, autoscaling, and enterprise features, Vertex AI hosts Gemma 4 with standard Google Cloud pricing. Roughly $0.15 per million input tokens for the 26B MoE and $0.60 per million for the 31B Dense. Cheaper than GPT-5 API pricing by a wide margin, and you get Google's infrastructure reliability.

4. Hugging Face Transformers (Maximum Control)

The model weights are on Hugging Face (google/gemma-4-31b-it). You can load them directly with the Transformers library, use vLLM for optimized serving, or integrate into any Python pipeline. This is the route for researchers and teams that need full control over inference parameters, batching, and fine-tuning.

How Does Gemma 4 Compare to Llama 4, Claude, and GPT-5?

Numbers from published benchmarks are useful, but they rarely match real-world performance. This table combines benchmark data with what I actually experienced running these models on my own projects. Take the benchmark columns as reference points, not gospel.

Dimension	Gemma 4 31B Dense	Llama 4 Scout	Claude 3.5 Sonnet	GPT-5
Parameters	31B Dense	109B (MoE, ~17B active)	Undisclosed	Undisclosed
License	Apache 2.0	Llama Community License	Proprietary	Proprietary
Price (API per 1M tokens)	~$0.60 (Vertex) / Free locally	~$0.80 (Together AI) / Free locally	~$3.00 input / $15.00 output	~$5.00 input / $15.00 output
Multimodal	Text + Image + Video + Audio	Text + Image	Text + Image	Text + Image + Audio
Context Window	256K tokens	10M tokens (claimed)	200K tokens	1M tokens
Function Calling	Native (2-3 steps reliable)	Native (similar reliability)	Native (4-5 steps reliable)	Native (5+ steps reliable)
Code Quality (my testing)	Good — single file tasks, routine refactoring	Good — similar to Gemma 4 on code	Excellent — multi-file, architectural	Excellent — strongest overall
Multilingual	140+ languages (strongest CJK)	Good	Strong	Strong
Runs Fully Local	Yes (24GB VRAM for 31B, 16GB for 26B MoE)	Yes (needs ~48GB+ for Scout)	No	No
Fine-Tunable	Yes (open weights)	Yes (open weights)	No	Fine-tuning API only

The pattern that emerges: Gemma 4 31B Dense is the most capable open-source dense model you can run on a single consumer GPU. Llama 4 Scout has more total parameters but uses MoE routing. Gemma 4 also has a 26B MoE variant that is even more hardware-friendly. Proprietary models still lead on complex tasks, but the gap has narrowed — the 31B Dense sitting at #3 on Arena AI is not a fluke. Another open-source model worth benchmarking against is Kimi K2.5, which takes a different approach to multilingual open-weight models.

One area where Gemma 4 surprised me was multilingual performance. With support for 140+ languages, I tested English-to-Chinese translation extensively for my sites, and the 31B Dense produced translations that were noticeably more natural than Llama 4 Scout. Google's training data for CJK languages seems more comprehensive — which makes sense given their search engine covers these markets deeply. It also edges out Qwen 3.5 27B on math benchmarks (roughly 89% vs 87% on AIME), though Qwen is marginally better on MMLU Pro (about 86% vs 85%).

For a complete breakdown of how these models perform across different coding tasks, see our AI model comparison guide which we update monthly.

What I Did Not Expect

Three things caught me off guard during the week I spent with Gemma 4. One was good, one was weird, and one was disappointing.

The Good: Structured Output Quality

I did not expect the JSON output to be this consistent. When I asked Gemma 4 31B to generate structured data — API response schemas, database migration configs, package.json updates — it produced valid JSON on roughly 95% of attempts. Claude is slightly better at this (~98%), but the gap is much smaller than I assumed it would be. For tasks like “read this TypeScript interface and generate a Zod schema that matches it,” Gemma 4 was correct every single time I tried it. That specific capability made me start using it for schema generation work instead of Claude.

The Weird: Inconsistent Persona Stability

When I set a system prompt telling Gemma 4 to act as a senior code reviewer, it would maintain that role for about 5-6 turns and then gradually drift into more casual, less precise responses. By turn 10 it was basically a different model — more verbose, less critical, and occasionally agreeing with code it should have flagged. Re-sending the system prompt mid-conversation fixed it temporarily. I have not seen this behavior as strongly in Claude or GPT-5.

The Disappointing: Context Window Usage

Gemma 4 31B Dense claims a 256K token context window, and technically that is true — you can send that many tokens. But I found quality degradation starting around 60-80K tokens. When I loaded a full Next.js project into context (about 100K tokens worth of code files), the model started confusing function names from different files, referencing variables that existed in one file but not the file it was supposed to be editing. At 50K tokens, everything was fine. The claimed 256K is a theoretical maximum, not a practical one for detailed code tasks.

This is not unique to Gemma 4. Llama 4 claims a 10 million token context but shows similar quality degradation at scale. The honest usable context for detailed code work is probably 40-60K tokens across all these open-source models right now.

What Are the Real Downsides?

I try to be direct about limitations because the official announcements and early reviews tend to cherry-pick benchmarks where the model wins.

✕ It is not GPT-5 or Claude Opus. On complex tasks — multi-file refactoring, nuanced architectural decisions, interpreting ambiguous requirements — there is still a clear gap. The 31B Dense is ranked #3 on Arena AI, which is remarkable for its size, but frontier proprietary models still lead on the hardest tasks. If you switch expecting full parity, you will be frustrated.
✕ Fine-tuning is practically necessary for specialized domains. Out of the box, Gemma 4 is a generalist. If your codebase uses unusual frameworks, internal DSLs, or domain-specific patterns, you will need to fine-tune. The good news: Unsloth has day-0 LoRA support for SFT, vision, audio, and RL fine-tuning. LoRA on the 26B MoE is accessible (a few hours on a single GPU), but fine-tuning the 31B Dense requires more serious hardware or cloud compute.
✕ Hardware requirements are real. The 31B Dense needs 24GB VRAM for full precision. The 26B MoE is more forgiving (16GB works), but if you do not have a recent mid-to-high-end GPU, you are limited to E4B or cloud APIs.
✕ The tooling around it is immature compared to OpenAI/Anthropic. Tooling around Gemma 4 — monitoring, eval frameworks, production deployment guides — is still catching up. Documentation exists but is scattered across Google AI, Hugging Face, and community repos. When something went wrong during my testing, I often had to read source code rather than documentation to figure out why.
✕ Safety guardrails can be overly aggressive. Gemma 4 refused several completely legitimate code-related prompts during my testing — things like “generate a password hashing function” and “write a script to test SQL injection vulnerabilities on my own app.” The safety layer seems tuned for general-purpose chat, not developer workflows. This is configurable if you are running locally, but it is annoying out of the box.

None of these are surprising for an open-source model. The question is whether the trade-offs are worth it for your situation. If your primary need is the strongest possible AI output and cost is secondary, stick with Claude or GPT-5. If you need local execution, Apache 2.0 licensing, fine-tuning capability, or are simply tired of paying $20-200/month for API access, Gemma 4 is the strongest option available right now.

Who Should Use Gemma 4?

After a week of daily use, my recommendations are more nuanced than “it is great for everyone” or “just use Claude.” It depends on what you value and what hardware you have.

Strong fit:

✓ Indie developers and small teams watching costs. If you are spending $50-200/month on AI API calls, Gemma 4 locally can replace a significant chunk of that. I estimated I could move about 40% of my daily AI usage to local Gemma 4 and save roughly $30-40/month.
✓ Companies with data residency requirements. Financial services, healthcare, government contractors — anyone who cannot send code to third-party APIs. Gemma 4 on your own infrastructure gives you full control over where data goes.
✓ Teams building AI-powered products. If you are embedding an LLM into your own product, Gemma 4's Apache 2.0 license and fine-tunability are significant advantages. You can customize it for your domain, deploy it on your infrastructure, and not worry about API rate limits or pricing changes.
✓ Developers who need strong multilingual support. If you work across English, Chinese, Japanese, Korean, or other languages Google covers well, Gemma 4 is currently the strongest open-source option for multilingual tasks.

Not a good fit:

✕ Developers who need the absolute strongest coding AI. If your work involves complex architectural reasoning, large codebase refactoring, or you are used to Claude Opus / GPT-5 quality, Gemma 4 will feel like a step down on the hardest tasks. It is good, not frontier-class.
✕ People without GPU hardware. Yes, you can use the cloud API, but the cost advantage disappears. And running the 31B Dense on CPU is impractical — response times measured in minutes, not seconds.

My personal setup after this review: I kept Claude as my primary for complex tasks and architectural decisions. I added Gemma 4 31B Dense (local via Ollama) for code review, schema generation, translation, and routine bug fixes. For lighter tasks during the day when I want speed over accuracy, the 26B MoE handles it. For those routine tasks, the quality is close enough that paying API fees does not make sense anymore.

For a broader view of open-source AI tools and how they fit into a developer workflow, check our OpenCode review — another open-source tool that pairs well with local models like Gemma 4.

FAQ

Is Google Gemma 4 free to use?

Yes. The model weights are Apache 2.0 — download, modify, fine-tune, and deploy commercially at no cost. Google AI Studio offers a free API tier with rate limits (~15 requests/min). Vertex AI charges roughly $0.15-0.60 per million tokens depending on the model size. Running locally via Ollama costs nothing beyond hardware and electricity.

What variants does Gemma 4 come in?

Four variants: E2B (2.3B, edge/mobile), E4B (4.5B, lightweight local), 26B MoE (3.8B active params, local sweet spot), and 31B Dense (flagship, ranked #3 on Arena AI). The 26B MoE fits on 16GB VRAM. The 31B Dense needs 24GB+ for full precision.

How does Gemma 4 compare to Llama 4?

Gemma 4 31B Dense scores roughly 85% on MMLU Pro and 89% on AIME 2026, leading Llama 4 Scout on reasoning and math. Gemma wins on multilingual (140+ languages) and multimodal (text + image + video + audio). Llama 4 claims a 10M token context window. Both have open licenses. The practical difference: Google infrastructure (Vertex AI, AI Studio) vs Meta (Together AI, Fireworks).

Can I run Gemma 4 locally on my laptop?

E2B (2.3B) runs on most modern laptops with no GPU. E4B (4.5B) needs about 8GB VRAM. The 26B MoE is the local sweet spot — only 3.8B active params, so 16GB VRAM is enough. The 31B Dense needs 24GB+ VRAM. Ollama makes setup easy — one command to pull and run. Quantized versions reduce memory by about 40-50%.

Does Gemma 4 support function calling and tool use?

Yes, native function calling with structured JSON output on 26B MoE and 31B Dense. The 31B handles 2-3 step chains reliably, 4-5 step chains about 60% of the time. The 26B MoE is decent for 2-step chains. E2B and E4B are too small for reliable tool use. Compared to Claude or GPT-5, the error rate on complex chains is roughly 2-3x higher.

Last Updated: April 8, 2026 • Written by: Jim Liu, Sydney-based developer running 5 websites. Tested Gemma 4 on real Next.js and Python projects for a week before writing this review.