Foundations · Module 01·8 min read

A Natural Introduction to Natural Language Processing

Before transformers, before embeddings, before tokens — there was the question. How do you teach a machine that doesn’t know anything to read?

The five-bullet version

  • Computers are bad at language because language runs on shared context, and computers start with none.
  • Modern NLP works by counting patterns in lots of text, not by encoding rules a linguist wrote down.
  • Models read tokens — sub-word chunks. The vocabulary is grown from data, not designed.
  • Each token becomes a vector, and similar meanings end up in nearby regions of that space.
  • Attention lets every token decide which other tokens to look at — the trick every transformer is built on.

§ 00 · WHY LANGUAGE IS HARDThe trouble with words

Pick any sentence. Read it slowly. “Time flies like an arrow; fruit flies like a banana.” A six-year-old gets the joke. A computer, until very recently, did not — because natural languagenatural language. The way humans actually speak and write — full of ambiguity, context, idioms, and exceptions. Distinct from formal languages like mathematics or code. is built on shared context, and computers start with none.

For most of computing’s history, we treated text the way we treated numbers — as data to be matched, sorted, and counted. That works beautifully for a phone book and miserably for a poem. The interesting problems in language are almost never about the surface form of the words. They are about what those words refer to, what was assumed, and what was left out.1

This is the puzzle of NLP — Natural Language ProcessingNatural Language Processing. The subfield of AI concerned with reading, understanding, and generating human language. NLP sits at the intersection of linguistics, statistics, and machine learning.. The field has had several lifetimes. It began as a branch of linguistics, became a branch of statistics, then a branch of machine learning, and today is mostly indistinguishable from deep learning at scale.

1950s · RulesHand-coded grammars1990s · Statisticaln-grams, HMMs2010s · NeuralWord2vec, LSTM2020s · PretrainedBERT, GPT, LLMsEach era unlocked ~10× more capability than the last.
Fig 1Four broad eras of NLP. Each unlocked roughly 10× more capability than the last — and required roughly 10× more data and compute.

§ 01 · FROM RULES TO STATISTICSThe shift that mattered

Early NLP systems were elaborate sets of rules. A linguist would sit with a programmer and try to encode, by hand, the grammar of English. They are precise, fast, and brittle. They break the moment a sentence drifts off-script — and most sentences do.

The shift came in the 1990s, when researchers asked a different question: instead of describing language, what if we just counted it? Given a million sentences, what’s the probability that the next word after "the cat sat on the" is "mat"? You can answer that with arithmetic, no linguist required.2

Every time I fire a linguist, the performance of the speech recognizer goes up.— Frederick Jelinek, IBM (apocryphal but illustrative)

Statistics over enough text beats rules — not because rules are wrong, but because language is too irregular to fit any rule set a human can write down. The statistics implicitly contain the rules, plus all the exceptions, plus the things linguists never noticed.

§ 02 · TOKENS — THE ATOMSBefore a model sees text, text becomes pieces

A model does not see characters. It does not see words. It sees tokenstokens. Sub-word units used by modern language models. A token is roughly 3-4 characters of English text. — chunks of text that a tokenizer has decided are useful. Sometimes a token is a word. Sometimes it’s part of a word. Sometimes it’s a piece of punctuation, or a leading space.

This step is unglamorous and easy to skip past. Don’t. The tokenizer’s choices ripple through everything: how long a model’s context window really is, why certain words cost more to generate, why models stumble on rare names. The atoms determine the chemistry.

TRY ITIn the lab below, type your own name. Then type a common phrase. Compare the chars / token ratio. Why is your name a worse deal?
Lab 01 · Live

Tokenization Playground

Type below. Watch your text get chopped into the units a model actually sees. Notice how leading spaces are part of the token.

The·quick·brown·fox·jumps·over·the·lazy·dog.
44
characters
11
tokens
4.00
chars / token
~$0.000033
if this were GPT-4o input
Try:

What you just played with is a stylized version of Byte-Pair EncodingByte-Pair Encoding. An algorithm that builds a vocabulary by repeatedly merging the most common adjacent pairs of characters., the trick behind GPT, Llama, Claude, and most modern LLMs.3 The vocabulary isn’t designed by a linguist; it’s grown from the training corpus by a greedy merging algorithm. The tokenizer is a frozen artifact of the data the model was trained on.

Below, you can watch that algorithm at work. Type any text and step through the merges one at a time. The most common adjacent pair becomes a new token. The most common adjacent pair in the new sequence becomes a new token. Repeat fifty thousand times.

LAB 01
Tokenizer Playground
Watch a vocabulary grow itself, one merge at a time.
50 tokens·28 distinct·1.00 chars/token
the quick brown fox jumps over the lazy dog. the fox is quick.
TryAdd a repeated word like the ten times. Watch it become a single token by step 2.
CHECKA model's context window is advertised as 8,000 tokens. Roughly how many words of English prose is that?

§ 03 · WORDS BECOME VECTORSThe geometry of meaning

Once we have tokens, we need to give them to a neural network. A neural network speaks numbers. So each token gets a number — not a single number, but a list of them, called an embeddingembedding. A vector representation of a token. Modern models use embeddings of 768 to 12,288 dimensions.. Each dimension holds one real-valued number.

What makes this strange and beautiful is what those numbers mean. The model adjusts them during training so that tokens with similar meaning end up close together in this high-dimensional space. The famous example: take the vector for king, subtract man, add woman, and you land near queen.4

The static embeddings popularized by word2vec and GloVe assigned exactly one vector per token. Contextual embeddings — produced by every layer of a transformer — assign a different vector each time, computed as a function of the surrounding sentence. The “queen” example survives in static spaces; in contextual spaces, the analogy is fuzzier and depends on which layer you probe.
Lab 03 · Live

Vector Space, in Two Dimensions

Hover any word to see its three nearest neighbors. Real embeddings live in 768+ dimensions — this is a 2D projection.

kingqueenprincewomanmanboyParisLondonTokyoFranceEnglandJapanrunranrunningwalk
Hover a word…

What you’re hovering on is a 2D projection of what’s really a 768-dimensional cloud. The clusters are real; the exact positions are an artifact of the projection. The geometry encodes facts the model was never explicitly told.

The strangest property of this geometry is that directions have meanings. The vector that points from man to woman is roughly the same vector that points from king to queen, or from boy to girl. Gender is a direction. So is plurality, tense, country-to-capital. You can do arithmetic on meanings.

LAB 04
Embedding Arithmetic
Add and subtract meanings as if they were vectors.
+swimmingd = 0.061
kingqueenmanwomanboygirlparisfrancelondonenglandtokyojapanwalkwalkingswimswimming
Presets:
Why this worksTraining pushes related words into parallel offsets. The “gender” or “tense” axis becomes a real direction in the space.

§ 04 · ATTENTION, BRIEFLYThe trick everything is built on

We have tokens. We have vectors. The last piece is the mechanism that lets the model decide, for each word, which other words to look at. This is attentionattention. A mechanism, introduced in 2017, that lets each token in a sequence selectively focus on other tokens., and it is the heart of every transformer.

Consider the sentence “The cat sat on the mat because it was tired.” What does it refer to? You knew instantly. The model has to compute it. Attention is how — every word produces a query, every other word produces a key, and the dot product tells you how relevant they are to each other.

Lab 02 · Live

Attention, Visualized

Click any word. The colored bars show how strongly it attends to other words. Different “heads” specialize in different patterns.

The0%cat76%sat15%on0%the0%mat10%because0%it0%was0%tired0%
it attends most strongly to cat (76% of its weight).

Different attention heads learn to specialize. One head might track grammatical subjects. Another might link pronouns to their referents. The model has many heads in many layers, and the patterns they learn are still being mapped — this is the active research area called mechanistic interpretabilitymechanistic interpretability. The project of reverse-engineering trained neural networks: finding the circuits and features that implement specific computations..

Checkpoint reached.You’ve covered tokens, embeddings, and the attention mechanism — the three primitives every modern language model rests on.

§ 05 · WHERE THIS LEADSThe next ten years, in two paragraphs

Everything you have just read describes the inputs and the core operation of a transformer. The transformer is the architecture behind every frontier language model in 2026 — Claude, GPT, Gemini, Llama, Qwen. The differences between them are real but smaller than the marketing suggests.

The interesting questions are no longer about architecture. They are about data, alignment, evaluation, and interpretability. Each of those is its own course on Brain Drip.

References & further reading

  1. Bender, E. M. & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. ACL 2020.
  2. Manning, C. D. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  3. Sennrich, R., Haddow, B. & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909.
  4. Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
  5. Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
  6. Olah, C. (2014). Deep Learning, NLP, and Representations. colah’s blog.
  7. Alammar, J. (2018). The Illustrated Transformer.

§ · GOING DEEPERHow the same operation became the field

The throughline from rules → statistics → neural networks → transformers is one operation getting incrementally cheaper to apply at scale. Statistical NLP counted token co-occurrence. Word2vec (2013) replaced the explicit counts with a low-dimensional vector trained by predicting context. ELMo (2018) made those vectors contextualby running them through a deep biLSTM. BERT (2018) and GPT (2018+) replaced recurrence with self-attention. Every step kept the contract — distributional meaning from raw text — and made each token’s representation richer and more dependent on neighbors.

If you only learn one thing about the lineage, learn this: the objective functionchanged less than you’d expect. Word2vec, BERT, and GPT all minimize a form of next-or-near-token log-likelihood. The architecture and the data scale moved. That’s why the same recipe — predict the next token on enough text — produced both fluent chat and the ability to write code.

§ · FURTHER READINGReferences & deeper sources

  1. Mikolov, Chen, Corrado, Dean (2013). Efficient Estimation of Word Representations in Vector Space (word2vec) · arXiv
  2. Pennington, Socher, Manning (2014). GloVe: Global Vectors for Word Representation · EMNLP
  3. Vaswani et al. (2017). Attention Is All You Need · NeurIPS
  4. Devlin, Chang, Lee, Toutanova (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · NAACL
  5. Brown et al. (2020). Language Models are Few-Shot Learners (GPT-3) · NeurIPS

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.