A Natural Introduction to Natural Language Processing
Before transformers, before embeddings, before tokens — there was the question. How do you teach a machine that doesn’t know anything to read?
The five-bullet version
- Computers are bad at language because language runs on shared context, and computers start with none.
- Modern NLP works by counting patterns in lots of text, not by encoding rules a linguist wrote down.
- Models read tokens — sub-word chunks. The vocabulary is grown from data, not designed.
- Each token becomes a vector, and similar meanings end up in nearby regions of that space.
- Attention lets every token decide which other tokens to look at — the trick every transformer is built on.
§ 00 · WHY LANGUAGE IS HARDThe trouble with words
Pick any sentence. Read it slowly. “Time flies like an arrow; fruit flies like a banana.” A six-year-old gets the joke. A computer, until very recently, did not — because natural languagenatural language. The way humans actually speak and write — full of ambiguity, context, idioms, and exceptions. Distinct from formal languages like mathematics or code. is built on shared context, and computers start with none.
For most of computing’s history, we treated text the way we treated numbers — as data to be matched, sorted, and counted. That works beautifully for a phone book and miserably for a poem. The interesting problems in language are almost never about the surface form of the words. They are about what those words refer to, what was assumed, and what was left out.1
This is the puzzle of NLP — Natural Language ProcessingNatural Language Processing. The subfield of AI concerned with reading, understanding, and generating human language. NLP sits at the intersection of linguistics, statistics, and machine learning.. The field has had several lifetimes. It began as a branch of linguistics, became a branch of statistics, then a branch of machine learning, and today is mostly indistinguishable from deep learning at scale.
§ 01 · FROM RULES TO STATISTICSThe shift that mattered
Early NLP systems were elaborate sets of rules. A linguist would sit with a programmer and try to encode, by hand, the grammar of English. They are precise, fast, and brittle. They break the moment a sentence drifts off-script — and most sentences do.
The shift came in the 1990s, when researchers asked a different question: instead of describing language, what if we just counted it? Given a million sentences, what’s the probability that the next word after "the cat sat on the" is "mat"? You can answer that with arithmetic, no linguist required.2
Every time I fire a linguist, the performance of the speech recognizer goes up.— Frederick Jelinek, IBM (apocryphal but illustrative)
Statistics over enough text beats rules — not because rules are wrong, but because language is too irregular to fit any rule set a human can write down. The statistics implicitly contain the rules, plus all the exceptions, plus the things linguists never noticed.
§ 02 · TOKENS — THE ATOMSBefore a model sees text, text becomes pieces
A model does not see characters. It does not see words. It sees tokenstokens. Sub-word units used by modern language models. A token is roughly 3-4 characters of English text. — chunks of text that a tokenizer has decided are useful. Sometimes a token is a word. Sometimes it’s part of a word. Sometimes it’s a piece of punctuation, or a leading space.
This step is unglamorous and easy to skip past. Don’t. The tokenizer’s choices ripple through everything: how long a model’s context window really is, why certain words cost more to generate, why models stumble on rare names. The atoms determine the chemistry.
Tokenization Playground
Type below. Watch your text get chopped into the units a model actually sees. Notice how leading spaces are part of the token.
What you just played with is a stylized version of Byte-Pair EncodingByte-Pair Encoding. An algorithm that builds a vocabulary by repeatedly merging the most common adjacent pairs of characters., the trick behind GPT, Llama, Claude, and most modern LLMs.3 The vocabulary isn’t designed by a linguist; it’s grown from the training corpus by a greedy merging algorithm. The tokenizer is a frozen artifact of the data the model was trained on.
Below, you can watch that algorithm at work. Type any text and step through the merges one at a time. The most common adjacent pair becomes a new token. The most common adjacent pair in the new sequence becomes a new token. Repeat fifty thousand times.
the ten times. Watch it become a single token by step 2.§ 03 · WORDS BECOME VECTORSThe geometry of meaning
Once we have tokens, we need to give them to a neural network. A neural network speaks numbers. So each token gets a number — not a single number, but a list of them, called an embeddingembedding. A vector representation of a token. Modern models use embeddings of 768 to 12,288 dimensions.. Each dimension holds one real-valued number.
What makes this strange and beautiful is what those numbers mean. The model adjusts them during training so that tokens with similar meaning end up close together in this high-dimensional space. The famous example: take the vector for king, subtract man, add woman, and you land near queen.4
Vector Space, in Two Dimensions
Hover any word to see its three nearest neighbors. Real embeddings live in 768+ dimensions — this is a 2D projection.
What you’re hovering on is a 2D projection of what’s really a 768-dimensional cloud. The clusters are real; the exact positions are an artifact of the projection. The geometry encodes facts the model was never explicitly told.
The strangest property of this geometry is that directions have meanings. The vector that points from man to woman is roughly the same vector that points from king to queen, or from boy to girl. Gender is a direction. So is plurality, tense, country-to-capital. You can do arithmetic on meanings.
§ 04 · ATTENTION, BRIEFLYThe trick everything is built on
We have tokens. We have vectors. The last piece is the mechanism that lets the model decide, for each word, which other words to look at. This is attentionattention. A mechanism, introduced in 2017, that lets each token in a sequence selectively focus on other tokens., and it is the heart of every transformer.
Consider the sentence “The cat sat on the mat because it was tired.” What does it refer to? You knew instantly. The model has to compute it. Attention is how — every word produces a query, every other word produces a key, and the dot product tells you how relevant they are to each other.
Attention, Visualized
Click any word. The colored bars show how strongly it attends to other words. Different “heads” specialize in different patterns.
Different attention heads learn to specialize. One head might track grammatical subjects. Another might link pronouns to their referents. The model has many heads in many layers, and the patterns they learn are still being mapped — this is the active research area called mechanistic interpretabilitymechanistic interpretability. The project of reverse-engineering trained neural networks: finding the circuits and features that implement specific computations..
§ 05 · WHERE THIS LEADSThe next ten years, in two paragraphs
Everything you have just read describes the inputs and the core operation of a transformer. The transformer is the architecture behind every frontier language model in 2026 — Claude, GPT, Gemini, Llama, Qwen. The differences between them are real but smaller than the marketing suggests.
The interesting questions are no longer about architecture. They are about data, alignment, evaluation, and interpretability. Each of those is its own course on Brain Drip.
References & further reading
- Bender, E. M. & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. ACL 2020.
- Manning, C. D. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
- Sennrich, R., Haddow, B. & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909.
- Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
- Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
- Olah, C. (2014). Deep Learning, NLP, and Representations. colah’s blog.
- Alammar, J. (2018). The Illustrated Transformer.
§ · GOING DEEPERHow the same operation became the field
The throughline from rules → statistics → neural networks → transformers is one operation getting incrementally cheaper to apply at scale. Statistical NLP counted token co-occurrence. Word2vec (2013) replaced the explicit counts with a low-dimensional vector trained by predicting context. ELMo (2018) made those vectors contextualby running them through a deep biLSTM. BERT (2018) and GPT (2018+) replaced recurrence with self-attention. Every step kept the contract — distributional meaning from raw text — and made each token’s representation richer and more dependent on neighbors.
If you only learn one thing about the lineage, learn this: the objective functionchanged less than you’d expect. Word2vec, BERT, and GPT all minimize a form of next-or-near-token log-likelihood. The architecture and the data scale moved. That’s why the same recipe — predict the next token on enough text — produced both fluent chat and the ability to write code.
§ · FURTHER READINGReferences & deeper sources
- (2013). Efficient Estimation of Word Representations in Vector Space (word2vec) · arXiv
- (2014). GloVe: Global Vectors for Word Representation · EMNLP
- (2017). Attention Is All You Need · NeurIPS
- (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · NAACL
- (2020). Language Models are Few-Shot Learners (GPT-3) · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.