AI Product Research - Indie Hacker Teardowns | OpenAI Tools Hub

TL;DR

Retell AI is the second name people mention when they talk about AI voice agent infrastructure, and the gap between first and second is smaller than you would guess from the funding announcements. Vapi raised the louder round and gets the louder Twitter discourse. Retell, founded by Yifei Wang out of YC W24, quietly built to roughly $600K MRR through a different route: ship a working dashboard before shipping a working SDK, court the call-center buyer instead of the developer-evangelist, and plant a flag in India and Southeast Asia where the unit economics of voice automation are violent enough to make procurement teams move in weeks rather than quarters.

The product is a hosted runtime that stitches together speech-to-text, an LLM, text-to-speech, and a telephony layer into a single websocket pipeline with a sub-500ms target turn latency. You point it at a number, give it a prompt and some function-call tools, and it picks up the phone. Pricing is the same usage-based model the whole category settled on around mid-2024: roughly $0.07 per minute of conversation, which bundles the LLM cost, the voice synthesis, and the carrier minutes into one line item. The seed round closed in December 2024 at $4.6M led by Altimeter, with participation from a handful of operator angels who run actual call centers.

What makes Retell interesting as a teardown subject is not the technology. The technology is now a commodity — every team in this space is wiring the same four vendors in roughly the same order. What makes it interesting is the geographic concentration. Most US infra companies treat India and SEA as an afterthought, a region that will eventually convert. Retell treated it as the beachhead. Call center labor in Bangalore or Manila is not free, and the buyers there are sophisticated enough to evaluate latency, accent, and barge-in behavior with a stopwatch. Winning there is harder than winning the US developer audience, but the contracts close faster and the customers stick longer.

The headline for any founder reading this: the platform layer is crowded and capital-intensive. The vertical layer sitting on top of Retell or Vapi — voice agents for a specific industry with a specific workflow — is wide open, low-capital, and where the next twenty $5K MRR products will come from.

Dimension	Score
Capital required	25 / 100
Stack reproducibility	35 / 100
Distribution channel	45 / 100
Network effect	30 / 100
Timing	60 / 100

In the Founder Own Words

"Alerting is live on Retell AI Set Alerts that trigger the moment something goes wrong with your agents. → Know when an agent burns $100 in minutes, success rate hits 0%, sentiment drops 20% and more. → Delivered via email or integrate into your existing tools via webhook"

@retellai, 2026-01-20 (source)

"Your Retell AI voice agents can now text during a call. Send an SMS. Receive a reply. See images. Keep talking. All in the same conversation. → Send links, forms, and payment URLs during the call without dictating → Collect structured info like email, address, order ID"

@retellai, 2026-04-03 (source)

"Retell AI now supports A/B testing Split live call traffic across multiple agents by percentage. Test prompts, voices, or flows on real calls. See which agent performs best, then scale the winner. Try it, it's live!"

@retellai, 2026-03-12 (source)

"Retell AI is now in ChatGPT Build, deploy, test, and monitor AI voice agents for real phone workflows, directly from ChatGPT. Reception, support, outreach, collections and more."

@retellai, 2026-03-10 (source)

"Retell AI voice agents now dynamically match the caller's speaking pace. Here's a quick demo Slow caller — agent slows down. Fast caller — agent speeds up. Caller says "slow down" — it does. One agent can now handle every speaking pace, automatically."

@retellai, 2026-03-06 (source)

Five Minutes Inside Retell AI

I signed up on a Tuesday evening with a throwaway Gmail and a US virtual number I keep around for these reviews. The dashboard loaded fast. The onboarding did not try to be clever — no animated mascot, no five-step tour with a progress bar. There was a sidebar with Agents, Phone Numbers, Knowledge Base, Call History, and a button that said Create Agent. I clicked it.

The agent builder is a single scrollable form. You pick a voice from a dropdown that pulls from ElevenLabs, PlayHT, and Cartesia. You write a system prompt in a textarea. You add tools — functions the agent can call mid-conversation to look something up or trigger a workflow. You set the LLM (GPT-4o-mini was the default, with options for Claude Haiku and a few Llama variants). You can attach a knowledge base, which is a vector store hidden behind a normal file uploader. The whole thing took me maybe six minutes to fill out for a fake dental appointment scheduler.

Then came the test. There is a Talk to Agent button at the top of the page that opens a browser-based call. I clicked it and it picked up. The first thing I noticed was that the agent did not wait for me to finish a sentence — it started responding while I was still trailing off, which is the right behavior for natural conversation but feels strange the first time. End-to-end turn latency felt like maybe 700-900ms in the browser, which is fine but not magical. The voice was an ElevenLabs Rachel-equivalent that sounded human enough that I forgot I was testing for about three turns.

Then I bought a phone number from the Numbers tab — $2 per month, instant provisioning through Twilio passthrough — and pointed my new agent at it. I dialed from my real phone. Over the actual PSTN, on a real cellular connection, latency was noticeably better than the browser test, probably because the audio codec was friendlier. The barge-in worked. When I tried to interrupt mid-sentence to change my appointment time, the agent stopped talking within maybe 200ms and adjusted. That is the moment that separates a working voice product from a demo.

The honest critiques: the analytics view is thin. You get call recordings and transcripts, you get a duration column, you do not get sentiment or topic clustering without bolting on your own pipeline. The Knowledge Base is fine for FAQs but does not handle structured data like business hours or pricing tables gracefully — I ended up putting all of that in the system prompt. The function-calling UI is clean but the error messages when a function fails mid-call are buried in the call log, not surfaced to the agent itself.

Three things stood out as well-engineered. First, the websocket reconnect logic — I deliberately killed my wifi mid-call to see what happened, and the agent gracefully degraded to a hold tone and picked up where it left off. Second, the prompt templates for common verticals (booking, qualification, support triage) are short and sharp rather than the bloated 2000-word system prompts most platforms ship as defaults. Third, the pricing meter on the dashboard ticks in real time during a call, which sounds like a small thing but is the kind of detail that makes finance teams trust the platform enough to put it in front of a CFO.

Business Model Deep Dive

Retell charges roughly $0.07 per minute of conversation. That number is not a list price you would find on a pricing page in 2023 — it is a 2024-and-later number that exists because the underlying components got cheap enough simultaneously. GPT-4o-mini at sub-penny per 1K tokens, ElevenLabs Flash v2 at fractions of a cent per character, Twilio inbound at $0.0085 per minute. The unit economics work because the markup on the combined bundle is healthy without any single line item being expensive enough to scare away a buyer.

The pricing structure has three tiers in practice even though the marketing page reads like a single rate. Self-serve usage starts at the $0.07 number with no commitment. Volume customers — anyone burning more than roughly 100K minutes a month — get a per-minute rate that drops into the $0.04-0.05 range with annual commits. Enterprise contracts include a custom SLA, dedicated capacity, and usually a six-figure annual minimum. The mix of these three tiers is what gets you from the headline $600K MRR to a believable revenue picture: a long tail of small accounts paying retail, a middle layer of mid-market SaaS companies running outbound qualification, and a small number of large contracts that probably account for 40-50% of revenue.

YC W24 was a meaningful accelerant. The batch graduated in March 2024, and the next twelve months were the window where every YC company that needed a voice layer defaulted to either Retell, Vapi, or Bland. Yifei has talked publicly about the YC alumni effect being worth more than any single piece of paid distribution — internal Slack channels, demo day partner intros, a stamp of credibility that lets enterprise buyers move past the procurement risk question of "what if you go out of business." The $4.6M seed in December 2024, led by Altimeter, came in at a valuation that has not been publicly disclosed but that operators in the space have estimated in the $40-80M range based on comparable rounds.

Customer concentration tilts toward three verticals. Call centers and BPOs in India and the Philippines use Retell to handle the long tail of low-complexity calls that human agents resent — appointment confirmations, address verification, post-call surveys. Real estate is the second cluster, mostly outbound lead qualification and inbound buyer-intent screening for high-volume agencies. Healthcare scheduling is the third, with a few mid-sized clinic networks running appointment reminders and intake at a scale where the savings versus a human receptionist clear the bar to overcome the HIPAA paperwork friction.

The revenue model has one structural risk worth naming. The underlying component costs are still falling. ElevenLabs has shipped lower tiers, OpenAI has shipped Realtime API which collapses the ASR-LLM-TTS pipeline into a single endpoint, and Twilio's pricing has been roughly flat for a decade. If a customer can wire those three things directly with a competent engineer, the value Retell adds is the orchestration layer, the dashboard, the call recording infrastructure, and the assumption of operational responsibility. That is real value, but it is the kind of value that gets pressured every six months as the build-versus-buy math shifts. The platforms that survive this category will be the ones that move up the stack into industry-specific workflows before the underlying primitives become trivial enough to glue together over a weekend.

The other structural feature worth understanding is that retention in voice agent platforms is bizarrely good once a customer is in production. Migrating a working voice agent from one platform to another is non-trivial — you have to re-test latency, re-tune the prompt for the new TTS voice quirks, re-validate barge-in behavior, re-port the phone numbers. The cost of switching is high enough that even a 20% price difference does not usually trigger a migration. This is why category leaders can afford to compete on quality and reliability rather than on price.

Tech Stack

The pipeline Retell runs is the canonical voice agent stack circa 2025, and the components are not secret because every team in the space wires them in roughly the same order.

Inbound audio comes in over Twilio or Vonage as a media stream. The audio gets forwarded to a speech-to-text endpoint, almost certainly Deepgram for the streaming case (their Nova-2 model is the de facto standard for sub-200ms partial transcripts) with Whisper as a fallback for batch transcription where latency does not matter. The partial transcripts feed a voice activity detection layer that decides when the user has stopped talking — getting this right is the difference between a natural-feeling agent and one that constantly interrupts the caller mid-sentence.

Once a turn is detected, the assembled transcript plus the conversation history gets sent to the LLM. GPT-4o-mini is the default for cost reasons, with Claude Haiku and a few Llama 3 variants available for customers who care about specific behaviors. The system prompt and any retrieved knowledge base chunks are stuffed in at the front, function definitions are passed in the standard tool-calling format, and the response streams back token by token.

The response stream feeds the TTS layer. ElevenLabs is the premium default — Flash v2 hits acceptable latency at acceptable quality. Cartesia Sonic is the alternative when you need lower latency and are willing to accept slightly less expressive voices. PlayHT and Coqui show up for customers who want voice cloning. The TTS output is chunked into audio frames and sent back through the same Twilio media stream to the caller.

The orchestration layer is where Retell adds proprietary value. Managing the websocket lifecycle, handling reconnects, buffering audio on the LLM side while waiting for a slow response, deciding when to barge in versus when to let the user finish, applying business logic between turns — none of this is rocket science but all of it is fiddly enough that getting it production-ready takes months. The team has hinted in interviews that the orchestration layer runs on Go or Rust services rather than Node, which would be consistent with the latency numbers I observed.

Storage and call recording sit on standard cloud infrastructure — likely S3 for audio, a Postgres instance for metadata, and a vector database (Pinecone or pgvector) for the knowledge base feature. The dashboard is a normal React app talking to a normal REST API. None of this is differentiated.

The interesting technical question is what happens when OpenAI's Realtime API and the equivalent from Anthropic mature to the point where they replace the ASR-LLM-TTS chain with a single speech-in-speech-out endpoint. Retell will need to be a wrapper around those primitives by then, with the value proposition shifting from "we orchestrate the components" to "we orchestrate the business logic, integrations, and operational reliability." Most of the engineering effort I would guess is being spent right now on staying ahead of that transition.

Distribution

The distribution playbook for Retell is not a single channel, it is four overlapping ones, and the weighting matters.

YC alumni distribution is the loudest in the first six months after a batch graduates. Yifei has been candid in podcast appearances that the YC batch Slack and the partner intros were the single largest source of paying customers in Q2-Q3 2024. Other YC companies need voice infrastructure, the trust is implicit, the contracts close in days not months. This channel does not scale — there are only so many YC companies and only so many of them need voice — but it gives you a base of paying customers and case studies without spending on paid acquisition.

Developer Twitter and the AI infra discourse on X is the second channel. Yifei and the Retell team have not been as aggressively present as the Vapi team, which was probably a strategic mistake in the short term and may turn out to be a feature in the long term. Vapi's loud developer evangelism brought them a huge surface area of free tier users and tinkerers, which is great for top-of-funnel but creates a support cost. Retell ran quieter, attracted fewer tire-kickers, and converted a higher percentage of signups into paying accounts.

Reddit r/voiceai and a handful of smaller subreddits where call center operators and outbound sales operators talk shop is the third channel. This is where the actual buyers hang out — the people running 50-seat call centers in Manila, the SaaS founders trying to automate outbound qualification, the real estate agencies in Mumbai that want to call 500 leads a day. The discourse here is not about model architecture, it is about which platform handles Indian English accents most reliably and which one has the lowest dropout rate over 3G connections in tier-2 cities. Retell shows up in these conversations more frequently than its raw search volume would predict, which suggests genuine word-of-mouth rather than astroturfing.

The fourth channel is the India/SEA enterprise sales motion, which is the bet that distinguishes Retell from its more US-focused competitors. The economics there are violent in the right direction — a 50-seat call center in Bangalore costs maybe $15-25K per month in salaries, and even a 20% deflection rate to AI agents saves more than the entire Retell bill. The buying process is fast because the ROI is calculable on a napkin. The reference customers compound because the buyers all know each other through the BPO industry's tight network. This channel is harder to start than developer marketing — you actually have to fly to Bangalore and shake hands — but once seeded it produces enterprise contracts that cover the burn from quieter quarters in the US.

The thing none of these channels generate is a defensible moat. None of them create network effects, none of them are exclusive, none of them prevent a well-capitalized competitor from running the same playbook a year later. The moat, if there is one, is the operational scar tissue of running production voice infrastructure at scale — the knowledge of which Twilio failure modes happen at 3am, the playbook for handling a TTS provider outage, the relationships with the underlying vendors that let you negotiate volume rates competitors cannot match. That is real but it is not magic.

Why Now, Why This Works

Three forces converged in 2024 that made the voice agent category go from interesting demo to viable business in about eighteen months.

The first is latency. Until GPT-4o, the LLM round trip was the bottleneck — you could optimize ASR and TTS down to a few hundred milliseconds each, but the LLM itself was taking 1-2 seconds to start streaming a response, and the total turn latency was unacceptable for natural conversation. GPT-4o's first-token latency in the 300-500ms range, combined with streaming TTS, made it possible to hit a total turn time under one second consistently. That is the threshold where callers stop noticing they are talking to a machine.

The second is the labor cost arbitrage. Call center labor in the US runs $15-25 per hour fully loaded. Call center labor in India or the Philippines runs $4-8 per hour. AI voice agents at $0.07 per minute work out to roughly $4 per hour of conversation, which is at parity with offshore labor and significantly below US domestic labor. This means the market for voice automation is not waiting for the technology to get cheaper — the technology is already cheap enough — it is waiting for buyers to trust the reliability enough to put it in production. The bottleneck has shifted from cost to confidence.

The third is the commoditization of telephony. Twilio made it boring and reliable a decade ago, and the API surface for buying phone numbers, routing calls, and streaming media is now stable enough that you can build on top of it without worrying about the underlying carrier dynamics. This sounds obvious but matters — every previous wave of voice automation tools had to spend half their engineering effort on the telephony layer itself. Retell and its competitors can spend all of theirs on the AI layer.

The reason this category will not collapse into a single winner the way some infrastructure categories do is that voice agents are inherently vertical. A platform that is excellent for dental appointment scheduling is overengineered for real estate cold calling and underengineered for healthcare triage. The platform layer can be horizontal, but the wedge into each industry is vertical, which means there is room for multiple platforms to win in different geographic and industry slices simultaneously.

Founder

Yifei Wang has the standard YC founder profile — engineering background, prior product experience at a larger company, the kind of crisp narrative arc that gets you into the batch. The interesting thing about his public communication is the consistency of the message: Retell is not trying to be the most innovative voice platform, it is trying to be the most reliable one in production. That positioning shows up in every interview I have heard — he talks about uptime, about failure modes, about the operational discipline of running real-time infrastructure, much more than he talks about model architecture or speech quality.

In a Latent Space podcast appearance in early 2025 he made the point that the hardest engineering problem at Retell is not the AI part, it is the websocket reliability under flaky network conditions. He has also been clear that the company's early growth came from being unfashionable in the right ways — courting BPOs and call centers rather than chasing the developer audience that other YC infra companies chase. That kind of discipline about who not to sell to is rare in early-stage founders.

The YC demo day pitch reportedly led with a single number — minutes of conversation handled per month — rather than a feature list or a moat story. The pitch worked because the number was big enough to be self-evidently a real business. Operators I have talked to who have met him describe him as calm, technically credible, and unusually willing to share specific operational numbers rather than vague directional commentary. That kind of transparency is what closes enterprise deals where the buyer is putting a real production workflow on top of your infrastructure.

Retell AI Teardown — $7M ARR YC W24 Voice Agent Platform

Copyable to YOU