Core Concepts · Module 03·9 min read

Tokenization

Before a model can read, it has to chop. Tokens are the atoms language models actually see — and the way they’re chosen quietly decides almost everything about how the model behaves.

Brain Drip EditorsUpdated May 2026·9 references

The five-bullet version

Models can’t read characters or whole words — both extremes break in practice.
Modern tokenizers split text into subwords: pieces between a letter and a word.
The vocabulary is learned from a training corpus, not designed by a linguist.
Byte-Pair Encoding (BPE) is the canonical algorithm: greedily merge the most common adjacent pair.
Token count is what you actually pay for, in money and in context window.

§ 00 · ATOMS OF MEANINGWhy text needs to be chopped at all

Language models don’t see text. They see lists of integers. Every number is an index into a fixed table — the vocabularyvocabulary. The complete set of token IDs the model is willing to read or write. Typically 32k–200k entries for a modern LLM.. The job of the tokenizer is to turn raw text into that list, and back. Everything else — embeddings, attention, the whole transformer — is downstream of this decision.

It looks like a clerical step. It isn’t. The choice of how to chop the text changes what the model can express. A model with a bad tokenizer is stuck the way a typist is stuck with bad keys.

Lab 01 · Live

Tokenization Playground

Type below. Watch your text get chopped into the units a model actually sees. Notice how leading spaces are part of the token.

The·quick·brown·fox·jumps·over·the·lazy·dog.

characters

tokens

4.00

chars / token

~$0.000033

if this were GPT-4o input

Try:

NOTE 01Try pasting an emoji, a long URL, or a programming identifier likeuseEffect. Modern tokenizers handle them — but the token count is rarely what you’d guess from looking at the text.

§ 01 · THE WHITESPACE PROBLEMWhy neither characters nor words work

Two obvious chopping rules — split on characters, split on whitespace — each break in opposite directions.

Characters. If every letter is a token, the vocabulary is tiny (~256, just the bytes), the model can write any string, and there are no out-of-vocabulary surprises. But a 1,000-word document becomes 5,000+ tokens. The model has to learn the spelling of every word before it can learn what the word means. Long-range structure — paragraphs, arguments — is drowned in letter prediction.

Words. If every whitespace-separated chunk is a token, each token carries useful meaning. But the vocabulary explodes — run, runs, running, runner, runnable, misunderstanding, unrelatable— each one a separate row. Worse, any word the model didn’t see in training becomes <UNK> at inference. Misspellings, new product names, code identifiers, German compounds — all opaque.

Fig 1The trade-off. Subword tokenization sits between the two extremes: longer than words, shorter than characters, and it never has to say <UNK>.

Subword tokenization sidesteps both problems. Common words get a single token. Rare or novel words get split into pieces — pieces that the model has seen elsewhere. Coverage is universal; the vocabulary stays bounded.

§ 02 · BPE — MERGE THE MOST COMMON PAIRHow the pieces are chosen

The standard recipe is Byte-Pair EncodingByte-Pair Encoding. A subword tokenization algorithm originally invented for data compression. Start with each byte as its own token, then repeatedly merge the most frequent adjacent pair until you've built up the vocabulary you want. — BPE. The algorithm is dumb in a useful way: start with every character as a token, count how often each pair of adjacent tokens occurs in the training corpus, merge the most common pair into a new token. Repeat until you have the vocab size you want — typically tens of thousands.

Lab · BPE training5 words, merge the most common pair · step 0

01low

02lowest

03newer

04wider

05lower

Most frequent adjacent pairs

l+o · 3o+w · 3w+e · 3e+r · 3e+s · 1

Greedy: pick top pair, repeat.

Watch what the merger picks: e+r, then l+o, then lo+w. The algorithm is finding the morphemes — suffixes, common stems — by counting alone, without anyone telling it what a morpheme is. After enough merges, frequent whole words (the, and) become single tokens. Less common words break into a stem plus an ending. Brand new strings break further, all the way down to individual bytes if needed.

BPE is what GPT-2, GPT-3, GPT-4, Llama, Mistral, and most open-weight models use, with small variants:

byte-level BPE (GPT-2 onward): operate on raw bytes rather than Unicode characters, so any byte sequence is representable — emoji, multilingual text, malformed UTF-8.
WordPiece (BERT family): same idea, slightly different merge criterion (likelihood instead of frequency).
SentencePiece(T5, Llama): wraps BPE or unigram tokenization with whitespace baked in as a normal token, so the tokenizer doesn’t need a pre-tokenization step.

§ 03 · THE VOCABULARY YOU END UP WITHWhat the result looks like

After training, you have a fixed table of subword tokens — usually 32,000 to 128,000 entries. Run any text through the trained tokenizer and you get back a list of integer IDs. The most common English words are one token each. Less common words split. tokenization in GPT-4 is token + ization. antidisestablishmentarianism is six tokens.

LAB 01

Tokenizer Playground

Watch a vocabulary grow itself, one merge at a time.

Text

BPE merges applied0 / 12

50 tokens·28 distinct·1.00 chars/token

the quick brown fox jumps over the lazy dog. the fox is quick.

TryAdd a repeated word like the ten times. Watch it become a single token by step 2.

Three things to notice as you type:

Spaces are part of the token. "hello" at the start of a sentence and " hello" mid-sentence are different tokens. The leading space is encoded.
Capitalization matters. The and the are different. So are API and api.
Code is dense. useEffect, self.user_id, and JSON keys are tokenized very differently than prose, often into more tokens per visual character.

§ 04 · WHAT THIS COSTS YOUWhy this is worth caring about

Tokens are the unit of three things you actually pay for: API cost, context window, and latency. So tokenization quietly drives:

Price. You pay per token. English prose averages ~0.75 tokens per word. Code averages ~1.5–2. Chinese, Japanese, Korean — often 2–3x the tokens of equivalent English. The same idea costs more to express in some languages.
Context.A “128k context” model gets you ~96,000 English words, or ~50,000 lines of code, or maybe 30,000 characters of Japanese. The same window holds wildly different amounts of meaning depending on the language and content.
What can be generated.A model can only emit tokens in its vocabulary. If your tokenizer can’t represent a string (a fresh emoji, a rare script), the model can’t produce it, period. This is why every release ships a new tokenizer for new languages or modalities.

CHECKA user types 'I'm reading about transformers!' into your app. About how many tokens is this for a modern LLM?

§ 05 · TAKING THIS FORWARDWhere to go next

Tokens are the input to the next thing the model does: turn each token into a vector. That’s the embedding layer, and it’s where words start to have neighbors. From there, attention lets each token decide which other tokens to look at — that’s the architecture piece. Both are next in this drip series.

Practical follow-up: when you’re budgeting context or estimating cost in your own application, use the model’s own tokenizer — don’t estimate from character counts. tiktoken for OpenAI, the Hugging Face tokenizer for everything else.

§ · GOING DEEPERWhere tokenizers actually fail

Two well-known failure modes are worth knowing because they bite in production. First, multilingual unfairness: most popular tokenizers were trained on English-heavy corpora, so the same idea in Hindi, Thai, or Burmese can cost 3–5× more tokens than in English. That’s a real per-call price difference, measured and reported by Petrov et al. (2023).

Second, tokenization brittlenesson numbers and code. Most tokenizers split “1234” and “1235” into similar but different token sequences with no notion that these are nearby numbers — which is why pre-2023 LLMs were famously bad at arithmetic. Llama 3 and GPT-4o moved to digit-aligned tokenization (each digit is its own token), which cleanly improved numeric reasoning. Worth keeping in mind: the atoms shape every downstream skill.

§ · FURTHER READINGReferences & deeper sources

Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE) · ACL
Kudo, Richardson (2018). SentencePiece: A simple and language independent subword tokenizer · EMNLP
Radford et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2, byte-level BPE) · OpenAI Technical Report
Bostrom, Durrett (2020). Byte Pair Encoding is Suboptimal for Language Model Pretraining · EMNLP Findings
Petrov, La Malfa, Torr, Bibi (2023). Language Model Tokenizers Introduce Unfairness Between Languages · NeurIPS

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.