Architectures · Module 20·9 min read

BERT

Half a transformer, trained to fill in blanks. BERT showed that an encoder-only model — bidirectionally aware of context — was a near- universal feature extractor for NLP tasks. The classifier that owned NLP from 2018 to 2022.

Brain Drip EditorsUpdated May 2026·10 references

The five-bullet version

BERT is the encoder half of a transformer, trained on a self-supervised “fill in the blank” task.
Bidirectional: each token attends to tokens on both sides — unlike GPT, which is left-to-right.
Pretrained once on Wikipedia + books, then fine-tuned for downstream tasks (classification, NER, QA).
Owned NLP benchmarks from 2018 to ~2022; mostly displaced by decoder-only LLMs but still alive for embeddings and small-task classifiers.
Sentence-BERT, RoBERTa, DeBERTa, ModernBERT — the lineage is still active.

§ 00 · THE ENCODER-ONLY TRANSFORMERHalf the architecture, different job

The original transformer had an encoder and a decoder. The encoder read the source; the decoder produced the target. GPT, in 2018, dropped the encoder — just stack decoder layers and have the model continue any text. BERTBERT. Bidirectional Encoder Representations from Transformers. A 2018 model from Google that uses only the encoder half of a transformer, trained on masked-language-modeling. Each token sees both left and right context, producing rich contextual embeddings that transfer to many downstream tasks., a few months later, did the opposite — drop the decoder, use only encoder layers, train them on a different objective.

The difference is more than cosmetic. A decoder-only transformer is causal: each token can only attend to tokens before it. This is necessary for next-token generation (the model has to produce one token at a time, without peeking ahead).

BERT’s encoder is bidirectional: each token attends to tokens both before and after. The output is a rich contextual embedding per token, informed by the full sentence. It can’t generate left-to-right text, but it can representtext exceptionally well.

§ 01 · MASKED LANGUAGE MODELINGCloze as pretraining

Bidirectional attention is incompatible with next-token prediction (the model would just copy the next token through the attention). BERT’s training task is masked language modelingmasked language modeling. A self-supervised training task: randomly mask out ~15% of tokens in the input, ask the model to predict the original tokens at the masked positions. Forces the model to use bidirectional context to reconstruct missing words. (MLM): randomly replace ~15% of tokens with a special [MASK] token, and train the model to predict what was there.

Lab · masked LM (cloze task)Pick what fills the [MASK]

The doctor prescribed antibiotics for the [MASK] .

Each masked prediction uses context on both sides. The model learns: what words are plausible given the surrounding sentence? Over billions of examples, this becomes a remarkably rich linguistic skill — without any human-labeled data.

§ 02 · BIDIRECTIONAL CONTEXTWhy two sides beat one

Compare a sentence with a missing word:

The doctor prescribed [MASK].— Could be almost any medicine, or even something more abstract. Left context alone doesn’t narrow it much.
The doctor prescribed [MASK] for the infection. — Right context narrows to antibiotics, antivirals, antifungals. The model has both sides.

For representing a token, two-sided context is strictly more informative than one-sided. BERT’s embeddings encode each token with awareness of the whole sentence. That richness is what made BERT dominate understandingtasks — classification, NER, sentence similarity — even though it can’t generate text.

§ 03 · FINE-TUNING IS THE POINTPretrain once, adapt many times

BERT’s pretraining is expensive — a few days on a TPU pod. Fine-tuning to a downstream task is fast and cheap. Take the pretrained BERT, add a small task-specific head (one or two linear layers), train the whole thing on labeled task data for a few hundred steps. Done.

Examples of what people fine-tuned BERT for:

Classification. Sentiment, intent, topic.
Named entity recognition. Tag spans as PERSON, ORG, DATE.
Question answering. Given a passage and a question, predict the start and end of the answer span.
Sentence similarity. Embed two sentences, compare.
Relevance ranking. Score (query, document) pairs — the foundation of modern search rerankers.

The pattern set by BERT — pretrain on cheap self-supervised objective, fine-tune on expensive labeled task — became the dominant paradigm in NLP for years. RoBERTa, DeBERTa, ALBERT, and countless domain-specific variants all use the same template.

§ 04 · WHERE BERT STILL WINSNiches that didn’t fully convert to LLMs

Most NLP has shifted to decoder-only LLMs since 2022. But BERT and its descendants still hold key territory:

Embeddings for retrieval / RAG. Sentence-BERT, BGE, and similar encoder-only models produce the chunk embeddings that power vector search. Decoder LLMs can do this but at much higher inference cost.
Cross-encoder rerankers. The second stage of modern retrieval pipelines. A BERT-style cross-encoder scores (query, chunk) pairs with much higher accuracy than dense vector similarity alone.
Production classification with strict latency budgets. A fine-tuned BERT-base does sentiment / intent / safety classification in milliseconds. LLMs would be 100× slower.
Span tasks. NER, span extraction. Encoder-only architectures still tend to be cleaner for these than asking an LLM to emit JSON.

Fig 1The post-2023 NLP stack. Encoder and decoder both have a job; production systems use whichever fits the task shape.

CHECKYou're building a content moderation classifier that runs on every user post (millions per day, <100ms budget). The task is well-defined and you have 50k labeled examples. Best approach?

§ 05 · TAKING THIS FORWARDWhere the encoder lives now

Encoder-only transformers are quietly more important than the LLM hype cycle suggests. Every RAG system uses one as the embedding model. Every modern reranker is a cross-encoder. Every fast in-app classifier runs on something BERT-shaped. The encoder lost the spotlight but kept the workload.

Modern descendants worth knowing: RoBERTa (cleaner training recipe), DeBERTa (improved attention), ModernBERT (2024 reset with FlashAttention, longer context, faster inference). For embeddings: Sentence-BERT, BGE, E5, Nomic Embed. The toolbox is active even when the headlines are decoder-only.

§ · GOING DEEPERMasked language modeling, RoBERTa fixes, and where BERT lives in 2026

BERT’s 2018 paper introduced two pretraining objectives: masked language modeling (predict 15% of randomly masked tokens) and next-sentence prediction. RoBERTa (Liu et al. 2019) showed NSP didn’t help and the original training recipe was undertuned — same architecture, longer training, more data, dropped NSP, got significantly better results. That paper is a useful reminder: pretraining recipes are often as load-bearing as architectures.

BERT-family models remain dominant where they have always been strongest: embedding generation (sentence- BERT, modern Sentence Transformers), cross-encoder rerankers for retrieval, and token-classificationtasks (NER, span tagging). DeBERTa (He et al. 2020) introduced disentangled attention and remains a quietly strong choice. ModernBERT (2024) is the architectural refresh — RoPE, longer context, GeGLU activations — bringing the BERT recipe to 2024 engineering norms.

§ · FURTHER READINGReferences & deeper sources

Devlin, Chang, Lee, Toutanova (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · NAACL
Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach · arXiv
He, Liu, Gao, Chen (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention · ICLR
Reimers, Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks · EMNLP
Warner et al. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder (ModernBERT) · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.