Groq Teardown — Jonathan Ross's $2.8B LPU Bet vs NVIDIA on Inference Speed

Verdict First

Groq is one of the few AI infrastructure companies I think genuinely has a shot at carving real estate out of NVIDIA's wall — but not for the reason most people on Twitter are screaming about. The "500 tokens per second" demos are a marketing wrapper around something more boring and more durable: a deterministic compute architecture that turns inference latency into a contract instead of a guess. If you've ever shipped a voice agent and watched p99 latency wreck your retention curve, you already understand why this matters more than the raw speed number.

You, the indie hacker reading this, cannot replicate Groq. The capital wall is between $300M and $1B depending on how you count the tapeouts. The team wall is harder — you need ex-Google TPU silicon people, and there are maybe 200 of them globally.

What you should walk away with: a clear understanding of which application categories are now economically viable because Groq exists, and why an 18-month window is opening for builders who pick the right speed-sensitive vertical. Voice agents under 300ms round trip. Real-time translation for live calls. NPC dialog in games where every token of delay breaks immersion. These are the wedges. Groq is the picks-and-shovels. You're the prospector.

Quick Facts

Field	Value
Product	Groq (GroqCloud API + LPU silicon)
Founded	2016
Founder	Jonathan Ross (ex-Google, original TPU architect)
HQ	Mountain View, California
Stage	Series D, $640M raised August 2024
Valuation	$2.8B post-money
Lead Investor (D)	BlackRock Private Equity Partners
Reported MRR	$4M+ (Aug 2024 baseline)
Business Model	API-metered inference, $/M tokens
Hosted Models	Llama 3.x, Mixtral 8x7B, Gemma, Whisper, DeepSeek-R1 distill
Headline Speed	500-800 tokens/sec on Llama 3 70B
Total Funding	$1B+ across Series A through D

In the Founder Own Words

"GPU LPU: Everything You Wanted to Know I’m joining David Senra ( @FoundersPodcast ) at @NVIDIAGTC for a conversation about the reality of modern inference. This is your opportunity to learn why Nvidia and Groq partnered together, and what it means for the future of inference."

@jonathanross321, 2026-03-13 (source)

"We just announced a major leap forward in AI inference: Groq is partnering with Meta to accelerate the official Llama API giving developers the fastest way to run the latest Llama models with no tradeoffs (starting with Llama 4). What developers get with Groq + Meta: Speeds"

@jonathanross321, 2025-04-30 (source)

"Jensen makes a compelling case for why no one should place a $100B, 2 year out order with Groq. It's an equally compelling case of why they should place a 51 day out, $100M with Groq, and then grow it from there. Build fast."

@jonathanross321, 2025-03-20 (source)

"Today Groq entered into a non-exclusive licensing agreement with Nvidia for Groq’s inference technology. Along with other members of the Groq team, I’ll be joining Nvidia to help integrate the licensed technology. GroqCloud will continue to operate without interruption. Learn"

@jonathanross321, 2025-12-24 (source)

"Thank you for building an iPhone in Groq orange. Now if only it ran AI at Groq Speed."

@jonathanross321, 2025-09-10 (source)

The Product

When you call the GroqCloud API, you're hitting an endpoint that returns tokens faster than anything else commercially available. The wrapper is intentionally boring — they cloned the OpenAI API shape so you can swap base_url and keep moving. Migration friction is the moat-killer in infrastructure, and they ate that friction.

Underneath the API: Groq builds a Language Processing Unit (LPU) — an ASIC designed from scratch around the specific access patterns of transformer inference. GPUs are pattern-matched to graphics first; LPUs are pattern-matched to attention and feed-forward layers from the silicon up.

That static-schedule thing is the part that matters for your business. If you're building a voice agent, you don't need 500 tok/sec average. You need the 99th percentile to be predictable. A GPU cluster running Llama 70B will give you 280 tok/sec average and a horrifying 1,800ms tail latency once every fifty requests. Groq's deterministic execution means the p99 is roughly the p50.

The model lineup is fully open-source-compatible. Llama 3.1 8B, 70B, 405B. Mixtral 8x7B. Gemma 2. Whisper Large v3. They added DeepSeek-R1 distill variants in early 2025. What they don't host: GPT-4, Claude, Gemini.

Pricing: Llama 3.1 70B at $0.59/M input tokens and $0.79/M output. Mixtral 8x7B at $0.24/M input. Whisper Large v3 at $0.111/hour of audio.

The Founder Bet

Jonathan Ross was at Google between roughly 2009 and 2016. He was the engineer who built the original TPU prototype as a 20% project. If you're trying to evaluate whether someone can pull off a custom AI chip company, the answer is: he's literally one of maybe five humans on Earth who has already done it once at scale.

He pitches Groq as a different category — inference vs training. NVIDIA owns training because flexibility matters there; Groq owns inference because determinism matters there. NVIDIA can't easily build a deterministic inference ASIC without cannibalizing their H100/B200 margins.

Business Model and Unit Economics

API-metered inference is one of the cleanest business models. The reported $4M MRR translates to roughly $48M ARR. For a company that just took $640M at a $2.8B valuation, that's a 58x ARR multiple. Defensible by AI infrastructure standards because the assumption is that ARR is doubling roughly every 90 days.

Gross margin: I can build a guess from chip costs and API prices, somewhere between 40% and 65% depending on utilization. If utilization is above 70%, this is a beautiful business. If it's below 40%, they're burning capital.

Competitive Landscape

The lazy frame is "Groq vs NVIDIA." That's wrong. NVIDIA isn't trying to win the deterministic-inference market because they make 80% gross margins on H100s for training.

The real competitors are in three layers:

Layer 1: Other custom-silicon AI inference companies. Cerebras (wafer-scale, similar pitch on speed, different architecture). SambaNova. Tenstorrent. Etched.

Layer 2: Hosted-open-model API competitors on GPU. Together.ai, Fireworks.ai, Replicate. These compete on price and ecosystem, not latency.

Layer 3: The hyperscalers. Google has TPU. AWS has Trainium and Inferentia. Azure has Maia.

Distribution — Why This Worked

Prong one: developer-twitter viral speed demos. Sometime in early 2024, the Groq team started letting people hit a public playground that ran Llama 70B in front of you in real time.

Prong two: OpenRouter integration. Groq plugged in early, and any developer who set their routing preference to "fastest" started getting Groq under the hood.

Prong three: Jonathan Ross's pedigree opens enterprise doors that no other inference startup can open.

Notably absent: SEO, content marketing, paid acquisition, partner channels. They don't run ads.

Why Now, Why This Window

NVIDIA Blackwell shipped in late 2024. By mid-2027, GPU-side inference latency will have improved enough that the speed advantage Groq has today shrinks from "10x" to "2-3x" for many workloads.

For builders: right now, today, you can build a voice agent that responds in 280ms end-to-end because of Groq. Eighteen months from now, your competitor might also be able to.

Groq Teardown — Jonathan Ross's $2.8B LPU Bet vs NVIDIA on Inference Speed

Copyable to YOU

Groq Teardown — Jonathan Ross's $2.8B LPU Bet vs NVIDIA on Inference Speed

Verdict First

Quick Facts

In the Founder Own Words

The Product

The Founder Bet

Business Model and Unit Economics

Competitive Landscape

Distribution — Why This Worked

Why Now, Why This Window

Replicate Playbook

Replicate Playbook