Tokenization
Before a model can read, it has to chop. Tokens are the atoms language models actually see — and the way they’re chosen quietly decides almost everything about how the model behaves.
The five-bullet version
- Models can’t read characters or whole words — both extremes break in practice.
- Modern tokenizers split text into subwords: pieces between a letter and a word.
- The vocabulary is learned from a training corpus, not designed by a linguist.
- Byte-Pair Encoding (BPE) is the canonical algorithm: greedily merge the most common adjacent pair.
- Token count is what you actually pay for, in money and in context window.
§ 00 · ATOMS OF MEANINGWhy text needs to be chopped at all
Language models don’t see text. They see lists of integers. Every number is an index into a fixed table — the vocabularyvocabulary. The complete set of token IDs the model is willing to read or write. Typically 32k–200k entries for a modern LLM.. The job of the tokenizer is to turn raw text into that list, and back. Everything else — embeddings, attention, the whole transformer — is downstream of this decision.
It looks like a clerical step. It isn’t. The choice of how to chop the text changes what the model can express. A model with a bad tokenizer is stuck the way a typist is stuck with bad keys.
Tokenization Playground
Type below. Watch your text get chopped into the units a model actually sees. Notice how leading spaces are part of the token.
§ 01 · THE WHITESPACE PROBLEMWhy neither characters nor words work
Two obvious chopping rules — split on characters, split on whitespace — each break in opposite directions.
Characters. If every letter is a token, the vocabulary is tiny (~256, just the bytes), the model can write any string, and there are no out-of-vocabulary surprises. But a 1,000-word document becomes 5,000+ tokens. The model has to learn the spelling of every word before it can learn what the word means. Long-range structure — paragraphs, arguments — is drowned in letter prediction.
Words. If every whitespace-separated chunk is a token, each token carries useful meaning. But the vocabulary explodes — run, runs, running, runner, runnable, misunderstanding, unrelatable— each one a separate row. Worse, any word the model didn’t see in training becomes <UNK> at inference. Misspellings, new product names, code identifiers, German compounds — all opaque.
Subword tokenization sidesteps both problems. Common words get a single token. Rare or novel words get split into pieces — pieces that the model has seen elsewhere. Coverage is universal; the vocabulary stays bounded.
§ 02 · BPE — MERGE THE MOST COMMON PAIRHow the pieces are chosen
The standard recipe is Byte-Pair EncodingByte-Pair Encoding. A subword tokenization algorithm originally invented for data compression. Start with each byte as its own token, then repeatedly merge the most frequent adjacent pair until you've built up the vocabulary you want. — BPE. The algorithm is dumb in a useful way: start with every character as a token, count how often each pair of adjacent tokens occurs in the training corpus, merge the most common pair into a new token. Repeat until you have the vocab size you want — typically tens of thousands.
Watch what the merger picks: e+r, then l+o, then lo+w. The algorithm is finding the morphemes — suffixes, common stems — by counting alone, without anyone telling it what a morpheme is. After enough merges, frequent whole words (the, and) become single tokens. Less common words break into a stem plus an ending. Brand new strings break further, all the way down to individual bytes if needed.
BPE is what GPT-2, GPT-3, GPT-4, Llama, Mistral, and most open-weight models use, with small variants:
- byte-level BPE (GPT-2 onward): operate on raw bytes rather than Unicode characters, so any byte sequence is representable — emoji, multilingual text, malformed UTF-8.
- WordPiece (BERT family): same idea, slightly different merge criterion (likelihood instead of frequency).
- SentencePiece(T5, Llama): wraps BPE or unigram tokenization with whitespace baked in as a normal token, so the tokenizer doesn’t need a pre-tokenization step.
§ 03 · THE VOCABULARY YOU END UP WITHWhat the result looks like
After training, you have a fixed table of subword tokens — usually 32,000 to 128,000 entries. Run any text through the trained tokenizer and you get back a list of integer IDs. The most common English words are one token each. Less common words split. tokenization in GPT-4 is token + ization. antidisestablishmentarianism is six tokens.
the ten times. Watch it become a single token by step 2.Three things to notice as you type:
- Spaces are part of the token.
"hello"at the start of a sentence and" hello"mid-sentence are different tokens. The leading space is encoded. - Capitalization matters.
Theandtheare different. So areAPIandapi. - Code is dense.
useEffect,self.user_id, and JSON keys are tokenized very differently than prose, often into more tokens per visual character.
§ 04 · WHAT THIS COSTS YOUWhy this is worth caring about
Tokens are the unit of three things you actually pay for: API cost, context window, and latency. So tokenization quietly drives:
- Price. You pay per token. English prose averages ~0.75 tokens per word. Code averages ~1.5–2. Chinese, Japanese, Korean — often 2–3x the tokens of equivalent English. The same idea costs more to express in some languages.
- Context.A “128k context” model gets you ~96,000 English words, or ~50,000 lines of code, or maybe 30,000 characters of Japanese. The same window holds wildly different amounts of meaning depending on the language and content.
- What can be generated.A model can only emit tokens in its vocabulary. If your tokenizer can’t represent a string (a fresh emoji, a rare script), the model can’t produce it, period. This is why every release ships a new tokenizer for new languages or modalities.
§ 05 · TAKING THIS FORWARDWhere to go next
Tokens are the input to the next thing the model does: turn each token into a vector. That’s the embedding layer, and it’s where words start to have neighbors. From there, attention lets each token decide which other tokens to look at — that’s the architecture piece. Both are next in this drip series.
Practical follow-up: when you’re budgeting context or estimating cost in your own application, use the model’s own tokenizer — don’t estimate from character counts. tiktoken for OpenAI, the Hugging Face tokenizer for everything else.
§ · GOING DEEPERWhere tokenizers actually fail
Two well-known failure modes are worth knowing because they bite in production. First, multilingual unfairness: most popular tokenizers were trained on English-heavy corpora, so the same idea in Hindi, Thai, or Burmese can cost 3–5× more tokens than in English. That’s a real per-call price difference, measured and reported by Petrov et al. (2023).
Second, tokenization brittlenesson numbers and code. Most tokenizers split “1234” and “1235” into similar but different token sequences with no notion that these are nearby numbers — which is why pre-2023 LLMs were famously bad at arithmetic. Llama 3 and GPT-4o moved to digit-aligned tokenization (each digit is its own token), which cleanly improved numeric reasoning. Worth keeping in mind: the atoms shape every downstream skill.
§ · FURTHER READINGReferences & deeper sources
- (2016). Neural Machine Translation of Rare Words with Subword Units (BPE) · ACL
- (2018). SentencePiece: A simple and language independent subword tokenizer · EMNLP
- (2019). Language Models are Unsupervised Multitask Learners (GPT-2, byte-level BPE) · OpenAI Technical Report
- (2020). Byte Pair Encoding is Suboptimal for Language Model Pretraining · EMNLP Findings
- (2023). Language Model Tokenizers Introduce Unfairness Between Languages · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.