LLM Tokenization Explained
Tokens are the atomic units LLMs process — not words, but subword pieces. Understanding tokens helps you write better prompts and manage API costs.
TL;DR: Tokens are the atomic units LLMs process — not words, but subword pieces. Understanding tokens helps you write better prompts and manage API costs.
What is a Token?
A token is a chunk of text — roughly 3-4 characters in English. "ChatGPT" is 3 tokens: "Chat", "G", "PT". Common words like "the" are 1 token; rare words may be split into many. Most LLMs charge per token.
Why Not Just Use Words?
Word-level tokenization struggles with rare words, typos, and different languages. Subword tokenization (like Byte-Pair Encoding, used by GPT) handles any text by building a vocabulary of frequent character sequences.
Tokens and Context Windows
Every LLM has a maximum context window measured in tokens — GPT-4o's is 128K, Claude 3.5 Sonnet's is 200K. Both your prompt AND the model's response consume tokens from this budget.
Practical Token Tips
Use the OpenAIToolsHub Token Counter tool to estimate costs before sending large prompts. As a rule of thumb: 1,000 tokens ≈ 750 English words. Non-English text (especially CJK characters) is typically 1 token per character.