Why do LLM APIs charge input and output tokens separately?

Generating tokens (output) is computationally far more expensive than reading tokens (input). During inference, the GPU runs a full forward pass for every output token but only a single parallel pass for the entire input prompt. That difference in compute manifests as a 2-5x price gap: GPT-4o charges $2.50/M input vs $10.00/M output, and Claude Sonnet 4.6 charges $3.00/M vs $15.00/M. Getting this ratio wrong in your cost model leads to large underestimates on generation-heavy workloads like code review or creative writing.

How do I estimate my token count without running the API?

A rough rule of thumb: 1 token ≈ 0.75 English words, so 1000 words ≈ 1,333 tokens. For code, tokens run denser — 1 token ≈ 3-4 characters. OpenAI's tiktoken library (open source, works offline) gives exact counts for GPT-4o. Anthropic's tokenizer is close enough to tiktoken for planning purposes. If you have existing logs, run a 1-week sample and multiply by monthly volume — that beats any estimate.

Which model is cheapest for RAG (retrieval-augmented generation)?

DeepSeek V4 Pro at $0.14/M input is the clear winner for RAG workloads because RAG is input-heavy by design: you inject large retrieval chunks into the prompt, and the model writes a short summary or answer. At a 70/30 input/output split with 100M total tokens, DeepSeek costs about $20 vs $385 for GPT-4o and $390 for Claude Sonnet 4.6. The tradeoff is latency and availability — DeepSeek's API is rate-limited and slower than OpenAI for large production loads. Gemini 2.5 Pro's 1M-token context window makes it the strongest choice if you need long-document RAG without chunking.

What are context windows, and how do they affect price?

A context window is the maximum number of tokens a model can read in one call (input + output combined). GPT-4o supports 128K tokens, Claude Sonnet 4.6 supports 200K, and Gemini 2.5 Pro supports 1M. Larger context windows let you send longer documents without chunking, but every token in the window counts toward your input cost. Sending a 100K-token document with GPT-4o costs $0.25 per call. If you run 100,000 such calls per month, that's $25,000 in input costs alone — choose model context size carefully.

Do any of these models offer batch API discounts?

Yes. OpenAI Batch API offers 50% off for asynchronous jobs with up to 24-hour turnaround — GPT-4o drops to $1.25/M input and $5.00/M output. Anthropic's Message Batches API offers the same 50% discount for Claude models. Google Vertex AI offers committed-use discounts for Gemini but not a simple batch tier. DeepSeek does not currently offer a formal batch discount. For bulk processing workloads with relaxed latency requirements, batch mode nearly halves your bill.

Why does the calculator show input and output cost separately?

Because the ratio between input and output tokens can swing your monthly bill by 2-5x for the same total token count. A "chatbot" at 10M total tokens has 5M input + 5M output; a "code review" has 2M input + 8M output. With Claude Sonnet 4.6, the chatbot costs $90 but the code review costs $126 — 40% more at identical total volume. Separate fields let you model your actual ratio instead of guessing from a single slider.

How accurate are these prices? Are they current?

Prices were verified against official API pricing pages in May 2026. LLM pricing has been volatile — DeepSeek cut prices 75% in early 2026, and OpenAI regularly adjusts rates for new model versions. Always re-check vendor pricing pages before committing to a budget. Links: platform.openai.com/docs/pricing, anthropic.com/pricing, deepseek.com/api, ai.google.dev/gemini-api/docs/pricing.

What is NOT included in this calculator?

Three things not modelled: (1) Prompt caching — Anthropic and OpenAI both offer cached-prompt discounts (up to 90% off for repeated prefixes), which can cut input costs dramatically for consistent system prompts. (2) Fine-tuning costs — training runs and fine-tuned model hosting are priced separately. (3) Embedding API calls — if you use text-embedding-3-small for RAG chunking, that's an additional $0.02/M tokens budget. Add 10-20% buffer to the calculator's output for a realistic production estimate.

Back to Tools

LLM API Token Cost CalculatorGPT-4o vs Claude Sonnet 4.6 vs DeepSeek V4 Pro vs Gemini 2.5 Pro

Compare monthly LLM API costs by actual input/output token ratio — not just total volume. Select your use-case profile, adjust token counts, and see exactly which model saves money for your workload. Prices sourced from official API pricing pages in May 2026.

Updated May 25, 2026 · By Jim Liu

TL;DR

• Cheapest for most workloads: DeepSeek V4 Pro at $0.14/M input — 18x cheaper than GPT-4o, 95x cheaper than Claude Sonnet 4.6 on input cost alone.
• Output costs dominate code review / generation: Output tokens cost 4-5x more than input tokens on every model. A chatbot that generates 80% output will cost 2-3x more than a RAG app with 70% input.
• Best input-to-output ratio among premium models: Gemini 2.5 Pro at $1.25/M input + $10/M output beats GPT-4o ($2.50) and Claude ($3.00) for input-heavy workloads and offers a 1M-token context window.
• Batch API cuts costs 50%: OpenAI Batch API and Anthropic Message Batches both offer 50% off for async jobs with 24-hour turnaround.
• Prompt caching not included: Repeated system prompts with Anthropic caching can cut input costs up to 90% — add that manually for accurate production estimates.

Configure your workload

Use-case profile (sets input/output ratio)

Balanced turns, moderate context, balanced cost

Total tokens / month

Input Tokens / Month (M)

Context, system prompt, retrieved chunks

Output Tokens / Month (M)

Generated text, code, answers, completions

Total tokens

10.0M

Input

5.0M(50%)

Output

5.0M(50%)

Cheapest at this workload

DeepSeek V4 Proby DeepSeek$2.10/mo

Cheapest frontier-grade model at $0.14/M input. 18-95x cheaper than competitors. Best ROI for bulk processing and RAG.

Side-by-side monthly cost

Model	Provider	Input cost	Output cost	Total / mo	vs cheapest
DeepSeek V4 Pro	DeepSeek	$0.70	$1.40	$2.10	Cheapest
Gemini 2.5 Pro	Google	$6.25	$50.0	$56.3	+2579%
GPT-4o	OpenAI	$12.5	$50.0	$62.5	+2876%
Claude Sonnet 4.6	Anthropic	$15.0	$75.0	$90.0	+4186%

Based on public pricing pages reviewed May 2026. Input cost = inputM × input rate; output cost = outputM × output rate. Batch API (50% off) and prompt caching discounts not applied. Real bills may differ by 10-25% due to rounding, minimum charges, and regional pricing.

By the Numbers — Current API Pricing (May 2026)

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)	Output / Input ratio
GPT-4o	OpenAI	$2.50/M	$10.00/M	4.0x
Claude Sonnet 4.6	Anthropic	$3.00/M	$15.00/M	5.0x
DeepSeek V4 Pro	DeepSeek	$0.14/M	$0.28/M	2.0x
Gemini 2.5 Pro	Google	$1.25/M	$10.00/M	8.0x

DeepSeek V4 Pro redefined the cost floor in early 2026 with a 75% price cut, landing at $0.14/M input and $0.28/M output. At those rates, 100M tokens of mixed workload costs roughly $21 — a bill that would be $385 with GPT-4o or $390 with Claude Sonnet 4.6. The primary trade-off is API rate limits and higher latency from overseas routing.

Gemini 2.5 Pro offers the best input-to-output price ratio among premium models at $1.25/M input and $10/M output (8x ratio). Its 1M-token context window is particularly valuable for long-document RAG: you can feed an entire codebase or legal document without chunking, eliminating retrieval pipeline complexity entirely.

GPT-4o at $2.50/M input and $10.00/M output remains the reliability benchmark. Its 99.9% uptime SLA, fine-tuning support, and consistent latency make it the default for production workloads where cost is secondary to stability. The 4x output multiplier means a balanced 50/50 chatbot spends 80% of its bill on output tokens.

Claude Sonnet 4.6 at $3.00/M input and $15.00/M output carries the highest output price in this comparison — 5x the input rate. It is the right choice for quality-critical generation tasks (structured documents, code, reasoning chains) where the marginal quality improvement justifies the premium. Anthropic's prompt caching can cut effective input costs by up to 90% for repeated system prompts, narrowing the gap with cheaper alternatives.

FAQ

: Generating tokens (output) is computationally far more expensive than reading tokens (input). During inference, the GPU runs a full forward pass for every output token but only a single parallel pass for the entire input prompt. That difference in compute manifests as a 2-5x price gap: GPT-4o charges $2.50/M input vs $10.00/M output, and Claude Sonnet 4.6 charges $3.00/M vs $15.00/M. Getting this ratio wrong in your cost model leads to large underestimates on generation-heavy workloads like code review or creative writing.