Core Concepts · Module 03·9 min read

Tokenization

Before a model can read, it has to chop. Tokens are the atoms language models actually see — and the way they’re chosen quietly decides almost everything about how the model behaves.

The five-bullet version

  • Models can’t read characters or whole words — both extremes break in practice.
  • Modern tokenizers split text into subwords: pieces between a letter and a word.
  • The vocabulary is learned from a training corpus, not designed by a linguist.
  • Byte-Pair Encoding (BPE) is the canonical algorithm: greedily merge the most common adjacent pair.
  • Token count is what you actually pay for, in money and in context window.

§ 00 · ATOMS OF MEANINGWhy text needs to be chopped at all

Language models don’t see text. They see lists of integers. Every number is an index into a fixed table — the vocabularyvocabulary. The complete set of token IDs the model is willing to read or write. Typically 32k–200k entries for a modern LLM.. The job of the tokenizer is to turn raw text into that list, and back. Everything else — embeddings, attention, the whole transformer — is downstream of this decision.

It looks like a clerical step. It isn’t. The choice of how to chop the text changes what the model can express. A model with a bad tokenizer is stuck the way a typist is stuck with bad keys.

Lab 01 · Live

Tokenization Playground

Type below. Watch your text get chopped into the units a model actually sees. Notice how leading spaces are part of the token.

The·quick·brown·fox·jumps·over·the·lazy·dog.
44
characters
11
tokens
4.00
chars / token
~$0.000033
if this were GPT-4o input
Try:

§ 01 · THE WHITESPACE PROBLEMWhy neither characters nor words work

Two obvious chopping rules — split on characters, split on whitespace — each break in opposite directions.

Characters. If every letter is a token, the vocabulary is tiny (~256, just the bytes), the model can write any string, and there are no out-of-vocabulary surprises. But a 1,000-word document becomes 5,000+ tokens. The model has to learn the spelling of every word before it can learn what the word means. Long-range structure — paragraphs, arguments — is drowned in letter prediction.

Words. If every whitespace-separated chunk is a token, each token carries useful meaning. But the vocabulary explodes — run, runs, running, runner, runnable, misunderstanding, unrelatable— each one a separate row. Worse, any word the model didn’t see in training becomes <UNK> at inference. Misspellings, new product names, code identifiers, German compounds — all opaque.

"transformers"Characterstransformers12 tokens · always representable, but loses meaningWordstransformers1 token · meaning-dense, but breaks on novel wordsSubwords (BPE)transformers2 tokens · the working compromise
Fig 1The trade-off. Subword tokenization sits between the two extremes: longer than words, shorter than characters, and it never has to say <UNK>.

Subword tokenization sidesteps both problems. Common words get a single token. Rare or novel words get split into pieces — pieces that the model has seen elsewhere. Coverage is universal; the vocabulary stays bounded.

§ 02 · BPE — MERGE THE MOST COMMON PAIRHow the pieces are chosen

The standard recipe is Byte-Pair EncodingByte-Pair Encoding. A subword tokenization algorithm originally invented for data compression. Start with each byte as its own token, then repeatedly merge the most frequent adjacent pair until you've built up the vocabulary you want. — BPE. The algorithm is dumb in a useful way: start with every character as a token, count how often each pair of adjacent tokens occurs in the training corpus, merge the most common pair into a new token. Repeat until you have the vocab size you want — typically tens of thousands.

Lab · BPE training5 words, merge the most common pair · step 0
01low
02lowest
03newer
04wider
05lower
Most frequent adjacent pairs
l+o · 3o+w · 3w+e · 3e+r · 3e+s · 1
Greedy: pick top pair, repeat.

Watch what the merger picks: e+r, then l+o, then lo+w. The algorithm is finding the morphemes — suffixes, common stems — by counting alone, without anyone telling it what a morpheme is. After enough merges, frequent whole words (the, and) become single tokens. Less common words break into a stem plus an ending. Brand new strings break further, all the way down to individual bytes if needed.

BPE is what GPT-2, GPT-3, GPT-4, Llama, Mistral, and most open-weight models use, with small variants:

§ 03 · THE VOCABULARY YOU END UP WITHWhat the result looks like

After training, you have a fixed table of subword tokens — usually 32,000 to 128,000 entries. Run any text through the trained tokenizer and you get back a list of integer IDs. The most common English words are one token each. Less common words split. tokenization in GPT-4 is token + ization. antidisestablishmentarianism is six tokens.

LAB 01
Tokenizer Playground
Watch a vocabulary grow itself, one merge at a time.
50 tokens·28 distinct·1.00 chars/token
the quick brown fox jumps over the lazy dog. the fox is quick.
TryAdd a repeated word like the ten times. Watch it become a single token by step 2.

Three things to notice as you type:

§ 04 · WHAT THIS COSTS YOUWhy this is worth caring about

Tokens are the unit of three things you actually pay for: API cost, context window, and latency. So tokenization quietly drives:

CHECKA user types 'I'm reading about transformers!' into your app. About how many tokens is this for a modern LLM?

§ 05 · TAKING THIS FORWARDWhere to go next

Tokens are the input to the next thing the model does: turn each token into a vector. That’s the embedding layer, and it’s where words start to have neighbors. From there, attention lets each token decide which other tokens to look at — that’s the architecture piece. Both are next in this drip series.

Practical follow-up: when you’re budgeting context or estimating cost in your own application, use the model’s own tokenizer — don’t estimate from character counts. tiktoken for OpenAI, the Hugging Face tokenizer for everything else.

§ · GOING DEEPERWhere tokenizers actually fail

Two well-known failure modes are worth knowing because they bite in production. First, multilingual unfairness: most popular tokenizers were trained on English-heavy corpora, so the same idea in Hindi, Thai, or Burmese can cost 3–5× more tokens than in English. That’s a real per-call price difference, measured and reported by Petrov et al. (2023).

Second, tokenization brittlenesson numbers and code. Most tokenizers split “1234” and “1235” into similar but different token sequences with no notion that these are nearby numbers — which is why pre-2023 LLMs were famously bad at arithmetic. Llama 3 and GPT-4o moved to digit-aligned tokenization (each digit is its own token), which cleanly improved numeric reasoning. Worth keeping in mind: the atoms shape every downstream skill.

§ · FURTHER READINGReferences & deeper sources

  1. Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE) · ACL
  2. Kudo, Richardson (2018). SentencePiece: A simple and language independent subword tokenizer · EMNLP
  3. Radford et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2, byte-level BPE) · OpenAI Technical Report
  4. Bostrom, Durrett (2020). Byte Pair Encoding is Suboptimal for Language Model Pretraining · EMNLP Findings
  5. Petrov, La Malfa, Torr, Bibi (2023). Language Model Tokenizers Introduce Unfairness Between Languages · NeurIPS

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.