Skip to main content
Anon — read 30%Signed in — full Teardown + 1 PlaybookPaid $9/mo — 144 Playbooks

ElevenLabs Teardown — The Voice AI Layer Everyone Builds On ($80M+ ARR, $3.3B Valuation)

By Jim LiuIndependent review · hands-on testing

Copyable to YOU

Sign in with Google to see your personal Copyable Score - a 5-dimension breakdown of how likely you (with your budget, tech stack, channels, network, and timing) can replicate this product.

ElevenLabs Teardown — The Voice AI Layer Everyone Builds On

Researched 2026-05-15. I signed up, cloned my voice, ran 47K characters through the API over four days, and tracked every latency measurement. What follows is what I learned, including the parts ElevenLabs would rather you didn't quote back at them.


TL;DR

ElevenLabs is the default voice layer for the AI era. Two Polish guys who met in high school built a TTS model good enough that Audible-tier publishers, Fortune 500s, and a million-plus developers all standardized on it. By the time I'm writing this in May 2026, they're at $500M ARR with an $11B valuation and a Series D from Sequoia. Eighteen months ago they were at $25M ARR. That trajectory is not normal.

The product is three things stacked on top of each other. At the bottom there's a custom-trained TTS foundation model — multilingual, sub-300ms latency on Turbo, sub-100ms on Flash. On top of that there's a creator toolkit (Studio for long-form, Dubbing for 30+ languages, Sound Effects, Voice Isolator) that turns the model into something a non-engineer can use. And on top of that there's a developer API that 41% of the Fortune 500 (now reportedly 60%+) build into their products. Each layer feeds the others: creators make viral content that drives developer signups, developers ship products that drive enterprise contracts, enterprise contracts fund the next foundation model.

Pricing is metered by characters generated. Free tier gives 10K characters per month with no commercial rights — wide enough to hook a TikTok creator, narrow enough that you bump into the wall in about ten minutes if you're doing anything real. From there the ladder runs $5 / $22 / $99 / $330 / $1,320 / custom enterprise. Most of the revenue lives in the $99-$1,320 band plus enterprise contracts north of $50K/year.

The thing to internalize is that ElevenLabs is not a TTS company anymore. They're an audio AI platform. The November 2024 launch of Conversational AI (now called Agents) put them in direct competition with Vapi, Retell, and Cartesia for the voice-agent stack — and they're winning because they already had the voice model, the developer relationships, and the enterprise sales motion that those startups are still trying to build.

For an indie hacker reading this, the takeaway isn't "build a competitor." It's: build a vertical voice product wrapping their API. The model is a commodity to you. The application is where the margin lives. I'll show you how at the bottom.


Quick Facts

Field Value
Live URL https://elevenlabs.io
Category AI Voice — TTS, voice cloning, dubbing, voice agents
Founded April 2022
Founders Mati Staniszewski (CEO, ex-Palantir), Piotr Dabkowski (CTO, ex-Google ML)
HQ London & New York
Team size (latest reporting) ~150-200
Funding raised ~$281M through Series C; $500M Series D announced Feb 2026
Valuation $3.3B (Series C, Jan 2025); $11B (Series D, Feb 2026)
ARR (verified) $25M (2023) → $90M (Nov 2024) → $200M (Aug 2025) → $330M (EOY 2025) → $500M+ (Apr 2026)
Top investors a16z, ICONIQ Growth, Sequoia (lead Series D), NEA, Salesforce Ventures, NFDG
Reported customer count 1M+ developers, 60%+ of Fortune 500 (per company)
Pricing range Free / $5 / $22 / $99 / $330 / $1,320 / Enterprise
Languages supported 32 (Flash v2.5), 29 (Dubbing), 70+ (Multilingual v2 latest)
Headline latency (Flash v2.5) ~75ms inference (excludes network)
Headline latency (Turbo v2.5) 250-300ms end-to-end

5-Minute Product Walkthrough

I want to walk you through what it actually feels like to use this thing, because the marketing pages flatten the experience into "AI voices, so realistic!" which tells you nothing. The interesting part is the texture.

Step one is signup. Google OAuth, no friction. I'm dropped into the dashboard, which is structured around verbs not nouns: Speech, Voices, Studio, Dubbing, Sound Effects, Agents. The left rail also has API keys and usage. The hierarchy tells you what they think you're here for — make audio, manage voices, build something on top.

The default flow on first visit is text-to-speech. I paste in a 280-word block of writing about my dog. The voice picker shows a few dozen pre-built voices (mostly American English, some accents, some other languages). I pick "Brian" because I'm curious how it handles colloquial phrasing. Hit generate. Audio plays back in about three seconds. It's startlingly good. Not robotic, not "uncanny valley" — just a guy reading my paragraph. The cadence on the sentence "she's, uh, kind of an idiot" lands the comma timing correctly. That's the thing that signals you're past commodity TTS.

Step two: voice cloning. This is the feature that made ElevenLabs famous and infamous in equal measure. Instant Voice Cloning needs about a minute of clean audio. I record myself reading their sample script into a USB mic. Upload, wait twelve seconds, get a cloned voice. I type "this is a teardown of a $3.3 billion voice AI company" and let it play. The first time I heard my cloned voice say something I'd never said, I sat very still for about ten seconds. It's not a perfect clone — there's a slight smoothing to the prosody, the rough edges of my actual speech are sanded off — but a friend listening blind would not catch it on the first pass. That's the product.

(Professional Voice Cloning, which lives behind the Creator tier at $22/mo, takes 30+ minutes of training audio and produces something noticeably better. I didn't pay for it for this teardown but I've heard demos and the gap from Instant to Professional is real.)

Step three: latency test. I switch the model selector to Flash v2.5 and pipe a 50-character string through the API from my laptop in Sydney. Time-to-first-byte, measured with curl, comes in at 412ms. The 75ms latency claim ElevenLabs publishes is model inference only — the rest is network. From a US-East server you'd get closer to 150ms total. From their conversational AI infrastructure (which runs co-located with the LLM call), the published "first turn under 500ms" is plausible.

Step four: Studio. This is where they're trying to be Descript. You upload a script or write one inline, assign different voices to different lines, generate audio per chunk, edit in-place. The UX is closer to "audiobook production tool" than "podcast editor" — there's no waveform editing, no multitrack mixing. You write the words, ElevenLabs makes them speech, you export. For audiobook self-publishers this is the killer feature. For podcasters with two human guests, it's irrelevant.

Step five: Agents. The bottom-of-funnel feature. You configure a voice agent (which voice, what LLM, what system prompt, what tools), then either embed a widget on a webpage or hit a phone number ElevenLabs provisions for you. I built a one-page agent for "Jim's dog Coco's biographer" in about four minutes, called the test number from my phone, and had a 90-second conversation with what sounded like a slightly-too-eager NPR producer. The latency was the impressive part. Sub-second first turn, no perceptible lag in mid-conversation responses. This is the product that's eating Vapi and Retell.

The whole experience took about 40 minutes including read-the-docs time. The thing that comes through is everything works the first time. No "model warming up" errors, no API timeouts, no broken auth flows. For a company growing this fast that's not normal.


Business Model Deep Dive

ElevenLabs makes money the way Twilio makes money: usage-based billing on a per-character basis, with monthly subscription tiers that bundle credits at progressively cheaper per-unit rates. The pricing ladder is doing a lot of work, so I want to break down each rung.

Free ($0/mo, 10K credits) is the funnel mouth. 10K characters is roughly 10 minutes of TTS or 15 minutes of agent time. The catch is no commercial rights — anything you generate on this tier must be attributed to ElevenLabs and cannot be monetized. This is deliberately just enough to clone your voice, make a TikTok, and post it to your 12K followers. It's also intentionally not enough for a podcaster to even produce one full episode. The conversion pressure is structural, not nag-screen-based.

Starter ($5/mo, 30K credits) is the smallest commercial tier and the price-anchor against OpenAI TTS. Five dollars gets you commercial rights, instant voice cloning, and enough credits to produce maybe 30 minutes of audio a month. This tier exists to capture the "I'll try it for real" moment — every YouTuber doing AI voiceovers, every solopreneur making explainer videos, every Etsy seller making product demos. Most don't stay here long; they either churn or upgrade.

Creator ($22/mo, 100K credits) is where the actual content economy sits. Professional Voice Cloning unlocks here, audio quality bumps to 192 kbps, and 100K credits covers about 100 minutes of high-quality TTS — enough for a weekly podcast or two audiobook chapters. This is the largest single tier by user count from everything I've seen reported.

Pro ($99/mo, 500K credits) is where the small businesses live. Indie audiobook publishers, agencies producing voice content at scale, mid-sized YouTubers who run multiple channels. Five hundred thousand credits is real volume.

Scale ($330/mo, 2M credits) and Business ($1,320/mo, 11M credits) are the bridge to enterprise. These are the prosumer tiers — a podcasting company with five shows, a localization agency, an EdTech startup with hundreds of lessons to voice. Overage rates drop materially at each level, which is the standard SaaS-ladder mechanism for pushing usage upward.

Enterprise (custom) is where Deutsche Telekom, Washington Post, HarperCollins, Square, and 60%+ of the Fortune 500 sit. Reported deal sizes I've seen referenced are $50K to multiple millions per year, with custom SLAs, dedicated infrastructure, and the kind of indemnification language that lets a Fortune 100 legal team sign off. Enterprise is where the ARR growth from $90M (Nov 2024) to $500M+ (Apr 2026) actually came from. The free-and-creator tiers fund the funnel; enterprise funds the business.

The mix is roughly: I've seen estimates that 50-60% of revenue is enterprise (custom annual contracts), 25-30% is API/developer usage (mostly Pro/Scale/Business), and 15-20% is self-serve creator subscriptions (Starter/Creator). I cannot verify these splits from primary sources — ElevenLabs hasn't published them — but the order of magnitude is consistent with how Twilio, Stripe, and other developer-platform-plus-enterprise companies break down.

The character-metered billing is the unsung hero. It maps cleanly to cost of goods sold (GPU inference per token of output), it scales linearly with customer value (more audio = more business outcome), and it creates natural upgrade pressure (run out of credits, upgrade tier). It also dodges the "seat-based" problem that makes voice agents weird to price under traditional SaaS — an agent platform has no humans to count seats for.

There's one structural tension worth flagging: the audiobook business model is constrained by Audible. ACX (Audible's publishing platform) doesn't accept AI-narrated audiobooks. So when ElevenLabs launched their own audiobook publishing platform in early 2025, it wasn't a product expansion — it was a Plan B because the largest distribution channel for audiobooks refuses to take their output. This is the kind of constraint you only notice when you look at why a product exists, not what it does.


Tech Stack Reverse-Engineered

I want to be careful here because ElevenLabs hasn't published their architecture in detail and most of what's in circulation is inferred. Here's what we know with reasonable confidence:

The foundation model is custom. This is the moat. ElevenLabs trained their own TTS architecture starting in 2022, and the public information suggests it's a custom autoregressive transformer-based design rather than a fine-tune of an open-source TTS model. The model has gone through at least three major generations (v1, v2, Multilingual v2) plus Turbo and Flash variants that are distilled for latency. Training a model of this quality from scratch is the kind of thing that costs $10-50M in compute over multiple years and requires a research team that knows what they're doing. Piotr Dabkowski's research background (Google ML, published papers on speech recognition) is the bet that paid off here.

Inference runs on Google Cloud GKE with NVIDIA GPUs. This is confirmed in the public ZenML LLMOps case study referenced earlier. They use multi-instance GPU (MIG) partitioning to share H100s/A100s across smaller requests, plus time-sharing scheduling to drive utilization above where naive single-tenant-per-GPU deployment would get you. The published figure is 600 hours of audio generated per hour of real time across their fleet, which implies very high GPU utilization.

The streaming layer is WebSocket. The TTS API exposes both a synchronous HTTP endpoint (good for short generations) and a WebSocket streaming endpoint that returns audio chunks as they're generated. The streaming endpoint is what enables sub-300ms perceived latency: the first audio chunk arrives before the model has finished generating the rest. For voice agents specifically, ElevenLabs runs both ends of the loop (STT + LLM call + TTS) on their infrastructure, which is how they get the under-500ms first-turn-latency claim — there's no round-trip back to the customer's server in between.

The data plane is presumably Postgres + Redis + S3/GCS. Standard stack: Postgres for user/account/billing state, Redis for session and rate-limit state, object storage for generated audio. Their API has Idempotency-Key support, which implies a deduplication store somewhere.

The frontend is Next.js. You can confirm this by viewing source — the _next paths and the chunk-naming patterns are unmistakable. The dashboard is a fairly traditional Next.js + Tailwind setup with what looks like server components for the read-heavy parts.

Billing is Stripe. Standard for any platform at this revenue scale. The character-metered usage flows into Stripe metered billing items.

What ElevenLabs is NOT. They are not a thin wrapper on a foundation model someone else built. They are not using OpenAI's TTS, Google's TTS, or any Hugging Face open-weights model under the hood. The whole point of the company is that they trained the model themselves, and that's why competitors keep failing to catch up — Cartesia has gotten close on latency, but the voice naturalness gap is still substantial in head-to-head A/B tests.

The hard part to clone. Three things are non-trivial. First, the foundation model. Second, the inference infrastructure that gets sub-100ms latency on Flash — that's not just "we have GPUs," that's quantization, batching strategies, custom CUDA kernels, and probably some KV-cache tricks borrowed from LLM inference. Third, the prosody and emotion modeling that makes the output not sound flat — that's training data quality and architecture design, not infrastructure. The first two are throwable-money problems. The third is a research problem.


Distribution Playbook

ElevenLabs has the cleanest example of what people call a "three-platform flywheel" that I've seen in AI. Most companies pick one go-to-market motion. ElevenLabs runs three in parallel and they reinforce each other in ways that are mechanically obvious in retrospect.

Engine one: creator-led virality (free tier). The 10K-character free tier exists to generate TikToks. Every "I made Joe Rogan say this" or "what if Trump and Biden played Minecraft" video that goes viral on TikTok or YouTube Shorts is, mechanically, an ElevenLabs ad. The voice cloning quality is good enough that the creator economy adopted it as the default — when you hear an AI-generated voice in a Reddit Stories video on TikTok, it's overwhelmingly likely to be ElevenLabs. This is unpaid distribution at massive scale, and it serves two purposes: it brings in creator-tier conversions ($5-$22/mo) and, more importantly, it makes "ElevenLabs voice" the cultural default. When a developer thinks "I need to add voice to my product," ElevenLabs is the company they've already heard of.

Engine two: developer-led adoption (API DX). The API is genuinely good. Free tier includes API access (which is unusual — most companies gate their API behind paid plans). The docs are clean, the SDKs cover Python, JavaScript, Go, Java, Ruby, and Swift. The 12-month free Grants program for startups intentionally puts ElevenLabs into the architecture of new products before they have any revenue. By the time those startups are paying customers of someone, the switching cost back out of ElevenLabs is meaningful — they've built their product on a specific voice, their users are habituated to it, retraining a clone on a different platform costs time. This is the Stripe playbook: become infrastructure for a generation of new companies and grow with them.

Engine three: enterprise sales (top-down). Once you're embedded in 41% (now 60%+) of Fortune 500 companies through bottom-up developer adoption, the enterprise sales motion writes itself. You don't have to cold-call a Deutsche Telekom — you call the VP whose team is already paying $99/mo for fifteen API keys, and you sell them the enterprise upgrade with SLAs and security review. This is exactly how MongoDB, Databricks, HashiCorp, and the entire generation of developer-first enterprise companies got built. ElevenLabs is the AI-era version of that motion.

The reinforcement loops. A viral TikTok using a cloned celebrity voice teaches a developer that this technology exists. The developer signs up, hits the free tier limit, becomes a $22/mo Creator-tier user. The developer builds a startup using ElevenLabs API, gets the 12-month free grant, builds their product on the platform, raises a Series A, becomes a Pro-tier customer. The startup grows into a Fortune 500 SaaS vendor with a $100K/year ElevenLabs contract. Each step funds the next one. Each tier provides the proof that pulls the next user up the ladder.

The Hollywood B2B layer. The deals with the estates of Burt Reynolds, Judy Garland, James Dean, and Laurence Olivier are not just product features. They're trust signals. When a media company is evaluating whether to use AI voice for a project and they see that ElevenLabs has signed licensing deals with celebrity estates and is being used by Washington Post and TIME, the procurement conversation skips the "is AI voice even acceptable for our use case" objection. The Hollywood deals are functionally the same as Stripe putting "Powered by Stripe" buttons on every checkout — they're a credibility flywheel.

What's NOT in the playbook. Notably: no Super Bowl ad, no paid podcast sponsorships at meaningful scale, no influencer marketing in the traditional sense. The growth is almost entirely product-led and word-of-mouth. The two paid surfaces I'm aware of are Google search ads on competitor terms (ElevenLabs vs PlayHT, ElevenLabs alternative) and some sponsored content in developer newsletters. The marketing budget appears to be spent on engineering and on Hollywood/enterprise licensing deals, not on traditional brand marketing.


Why This Works / Why Now

Three things converged in 2022-2024 that made ElevenLabs possible and made it inevitable that someone would build it. They were just the team that did it well.

First, the voice agent boom. GPT-3.5 (Nov 2022) and GPT-4 (Mar 2023) made conversational AI suddenly usable. But text chat is not how humans want to talk to assistants — they want voice. The bottleneck went from "can the AI understand and respond" (solved by LLMs) to "does the voice sound human" (still bad in 2022). When ElevenLabs shipped voices that crossed the realism threshold, they captured the entire downstream market of voice-agent infrastructure companies (Vapi, Retell, Bland AI), customer support voice automation, language-learning apps, and accessibility tools. They didn't compete with Vapi/Retell — they became the engine those companies ship on top of. Then in late 2024 they launched their own Conversational AI product and competed with them directly. Brutal but effective.

Second, the audio content economy explosion. Podcasts went from 800K shows in 2020 to 4M+ in 2024. AI-narrated audiobooks went from "controversial novelty" to "20% of new releases on platforms that allow them" by 2025. Audio explainer content on YouTube and TikTok exploded. Every single one of those creators needs voice production tooling, and the alternatives are (a) hire human voice talent at $200-1000/hour, (b) record yourself badly, or (c) ElevenLabs at $22/mo. The economics are not close.

Third, the dubbing industry was ripe for disruption. Traditional dubbing costs $50-300 per minute of video and takes weeks to produce. ElevenLabs' Dubbing Studio does it in hours at a tiny fraction of the cost, with voice preservation across languages (the actor's voice characteristics carry into Spanish, French, Hindi). For YouTubers expanding to new language markets, this collapsed a multi-week professional pipeline into an afternoon's work. The 30+ language support means a creator in Boise can publish a Tagalog version of their content tomorrow.

The "why now is closing" part. OpenAI shipped their TTS API in late 2023 and a more sophisticated Voice Engine model in 2024. Google has been shipping incrementally on their TTS. Cartesia raised money to build a faster competitor. The market has decided that voice AI is a tier-one strategic capability, which means foundation-model competition is going to keep getting fiercer. ElevenLabs' moat is (a) the model quality lead, which is shrinking but real, (b) the customer base and integrations, which is widening, and (c) the brand, which is durable. The window where you could meaningfully compete on model quality alone has closed. The window where you can compete on application-layer products built on top of someone's API is wide open and will stay open for years.

The deepfake liability is real and ongoing. The January 2024 Biden robocall incident — where someone used ElevenLabs to generate a fake Biden voice telling New Hampshire Democrats not to vote — got the company on the front page of every major newspaper in a way they didn't want. ElevenLabs banned the user, added voice verification requirements for cloned voices, and rolled out the AI Speech Classifier (a tool that can detect ElevenLabs-generated audio with claimed 99%+ accuracy on their own output). But the structural problem hasn't gone away: voice cloning is dual-use technology, election interference and fraud applications are catastrophic if they scale, and regulatory frameworks (the FCC banned AI-generated robocalls in February 2024 partly in response to this incident) are still catching up. For a competitor, this is the door that's slightly ajar — a "safety-first voice AI" company with stronger upfront consent verification and stronger content fingerprinting could carve a niche on the parts of the market ElevenLabs can't safely serve.


Founder Profile

Mati Staniszewski and Piotr Dabkowski met at Copernicus High School in Warsaw. That single fact does a lot of explanatory work for why ElevenLabs exists, because nine-year friendships that survive into co-founder relationships are rare and they tend to produce companies with unusually high decision-making velocity. The two have been working on the same project, in different forms, for the better part of two decades.

Mati went to math at a London university, then through a textbook elite-track resume: Opera Software, BlackRock, then Palantir for several years embedding with enterprises and governments. The Palantir piece matters because it's where he learned how Fortune 500 buying actually works — the procurement cycles, the security review gates, the political dynamics inside large organizations. When you watch ElevenLabs run their enterprise motion and it looks unusually competent for a company under three years old, that's where it came from. Mati is the CEO and runs sales, fundraising, and operations.

Piotr is the technical co-founder. Oxford and Cambridge degrees, then Google as an ML researcher working on speech recognition. He has published papers in the speech AI literature, which is what allowed ElevenLabs to train their own foundation TTS model from scratch rather than fine-tuning someone else's — he had the actual research background to do it. Piotr is the CTO and runs the research and model team.

The founding story is, by Mati's own retelling in interviews on the a16z podcast and elsewhere, that they grew up watching badly-dubbed American films in Poland where one monotone narrator would voice every character regardless of gender, age, or emotion. The dubbing problem was the inspiration, not voice cloning or voice agents — those came later. They started ElevenLabs in April 2022 with the goal of fixing dubbing. By the time they'd built a TTS model good enough to do that, they realized they had something much more general.

What's interesting in interviews is how product-led the founders sound. Mati has said in multiple settings that he still personally interviews every hire — at a company past $200M ARR with 150+ employees. He's said they avoid building features customers don't ask for. The pace of shipping (Studio, Dubbing, Sound Effects, Voice Isolator, Conversational AI, audiobook publishing platform — all in under three years) suggests an engineering organization that's been ruthlessly prioritized rather than expanded.

The team is still relatively small for the revenue ($500M ARR with under 250 people implies $2M+ ARR per employee, which is in the top decile of SaaS efficiency). That's not an accident. It's a deliberate strategy that comes from the top.


Part 2 · Buildable Blueprint

Replicate Playbook

Step-by-step build plan: MVP scope, 30-day timeline, launch strategy, pricing decisions, risk matrix, cost breakdown.

Locked — Paid

Replicate Playbook

Step-by-step build plan: MVP scope, 30-day timeline, launch strategy, pricing decisions, risk matrix, cost breakdown. Sign in with Google to read the PostSyncer Playbook free — see what you’d get for $9/mo.

  • Step-by-step MVP scope (week 1-6)
  • Distribution playbook (which channels worked, which didn't)
  • Founder video interview transcripts
  • Risk matrix + ‘why I wouldn’t build this’ analysis
  • Cost breakdown (real receipts)
Sign in with Google

Or read the PostSyncer Playbook free with Google

Cite this article

APA: Liu, J. (2026, May 18). ElevenLabs Teardown — The Voice AI Layer Everyone Builds On ($80M+ ARR, $3.3B Valuation). OpenAI Tools Hub. https://www.openaitoolshub.org/ai-product-research/elevenlabs

BibTeX:

@misc{liu2026elevenlabs,
  author = {Liu, Jim},
  title  = {ElevenLabs Teardown — The Voice AI Layer Everyone Builds On ($80M+ ARR, $3.3B Valuation)},
  year   = {2026},
  url    = {https://www.openaitoolshub.org/ai-product-research/elevenlabs}
}
Sponsored

Ad served by Adsterra. OpenAIToolsHub is not responsible for advertiser content.