BERT
Half a transformer, trained to fill in blanks. BERT showed that an encoder-only model — bidirectionally aware of context — was a near- universal feature extractor for NLP tasks. The classifier that owned NLP from 2018 to 2022.
The five-bullet version
- BERT is the encoder half of a transformer, trained on a self-supervised “fill in the blank” task.
- Bidirectional: each token attends to tokens on both sides — unlike GPT, which is left-to-right.
- Pretrained once on Wikipedia + books, then fine-tuned for downstream tasks (classification, NER, QA).
- Owned NLP benchmarks from 2018 to ~2022; mostly displaced by decoder-only LLMs but still alive for embeddings and small-task classifiers.
- Sentence-BERT, RoBERTa, DeBERTa, ModernBERT — the lineage is still active.
§ 00 · THE ENCODER-ONLY TRANSFORMERHalf the architecture, different job
The original transformer had an encoder and a decoder. The encoder read the source; the decoder produced the target. GPT, in 2018, dropped the encoder — just stack decoder layers and have the model continue any text. BERTBERT. Bidirectional Encoder Representations from Transformers. A 2018 model from Google that uses only the encoder half of a transformer, trained on masked-language-modeling. Each token sees both left and right context, producing rich contextual embeddings that transfer to many downstream tasks., a few months later, did the opposite — drop the decoder, use only encoder layers, train them on a different objective.
The difference is more than cosmetic. A decoder-only transformer is causal: each token can only attend to tokens before it. This is necessary for next-token generation (the model has to produce one token at a time, without peeking ahead).
BERT’s encoder is bidirectional: each token attends to tokens both before and after. The output is a rich contextual embedding per token, informed by the full sentence. It can’t generate left-to-right text, but it can representtext exceptionally well.
§ 01 · MASKED LANGUAGE MODELINGCloze as pretraining
Bidirectional attention is incompatible with next-token prediction (the model would just copy the next token through the attention). BERT’s training task is masked language modelingmasked language modeling. A self-supervised training task: randomly mask out ~15% of tokens in the input, ask the model to predict the original tokens at the masked positions. Forces the model to use bidirectional context to reconstruct missing words. (MLM): randomly replace ~15% of tokens with a special [MASK] token, and train the model to predict what was there.
[MASK]Each masked prediction uses context on both sides. The model learns: what words are plausible given the surrounding sentence? Over billions of examples, this becomes a remarkably rich linguistic skill — without any human-labeled data.
§ 02 · BIDIRECTIONAL CONTEXTWhy two sides beat one
Compare a sentence with a missing word:
- The doctor prescribed [MASK].— Could be almost any medicine, or even something more abstract. Left context alone doesn’t narrow it much.
- The doctor prescribed [MASK] for the infection. — Right context narrows to antibiotics, antivirals, antifungals. The model has both sides.
For representing a token, two-sided context is strictly more informative than one-sided. BERT’s embeddings encode each token with awareness of the whole sentence. That richness is what made BERT dominate understandingtasks — classification, NER, sentence similarity — even though it can’t generate text.
§ 03 · FINE-TUNING IS THE POINTPretrain once, adapt many times
BERT’s pretraining is expensive — a few days on a TPU pod. Fine-tuning to a downstream task is fast and cheap. Take the pretrained BERT, add a small task-specific head (one or two linear layers), train the whole thing on labeled task data for a few hundred steps. Done.
Examples of what people fine-tuned BERT for:
- Classification. Sentiment, intent, topic.
- Named entity recognition. Tag spans as PERSON, ORG, DATE.
- Question answering. Given a passage and a question, predict the start and end of the answer span.
- Sentence similarity. Embed two sentences, compare.
- Relevance ranking. Score (query, document) pairs — the foundation of modern search rerankers.
The pattern set by BERT — pretrain on cheap self-supervised objective, fine-tune on expensive labeled task — became the dominant paradigm in NLP for years. RoBERTa, DeBERTa, ALBERT, and countless domain-specific variants all use the same template.
§ 04 · WHERE BERT STILL WINSNiches that didn’t fully convert to LLMs
Most NLP has shifted to decoder-only LLMs since 2022. But BERT and its descendants still hold key territory:
- Embeddings for retrieval / RAG. Sentence-BERT, BGE, and similar encoder-only models produce the chunk embeddings that power vector search. Decoder LLMs can do this but at much higher inference cost.
- Cross-encoder rerankers. The second stage of modern retrieval pipelines. A BERT-style cross-encoder scores (query, chunk) pairs with much higher accuracy than dense vector similarity alone.
- Production classification with strict latency budgets. A fine-tuned BERT-base does sentiment / intent / safety classification in milliseconds. LLMs would be 100× slower.
- Span tasks. NER, span extraction. Encoder-only architectures still tend to be cleaner for these than asking an LLM to emit JSON.
§ 05 · TAKING THIS FORWARDWhere the encoder lives now
Encoder-only transformers are quietly more important than the LLM hype cycle suggests. Every RAG system uses one as the embedding model. Every modern reranker is a cross-encoder. Every fast in-app classifier runs on something BERT-shaped. The encoder lost the spotlight but kept the workload.
Modern descendants worth knowing: RoBERTa (cleaner training recipe), DeBERTa (improved attention), ModernBERT (2024 reset with FlashAttention, longer context, faster inference). For embeddings: Sentence-BERT, BGE, E5, Nomic Embed. The toolbox is active even when the headlines are decoder-only.
§ · GOING DEEPERMasked language modeling, RoBERTa fixes, and where BERT lives in 2026
BERT’s 2018 paper introduced two pretraining objectives: masked language modeling (predict 15% of randomly masked tokens) and next-sentence prediction. RoBERTa (Liu et al. 2019) showed NSP didn’t help and the original training recipe was undertuned — same architecture, longer training, more data, dropped NSP, got significantly better results. That paper is a useful reminder: pretraining recipes are often as load-bearing as architectures.
BERT-family models remain dominant where they have always been strongest: embedding generation (sentence- BERT, modern Sentence Transformers), cross-encoder rerankers for retrieval, and token-classificationtasks (NER, span tagging). DeBERTa (He et al. 2020) introduced disentangled attention and remains a quietly strong choice. ModernBERT (2024) is the architectural refresh — RoPE, longer context, GeGLU activations — bringing the BERT recipe to 2024 engineering norms.
§ · FURTHER READINGReferences & deeper sources
- (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · NAACL
- (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach · arXiv
- (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention · ICLR
- (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks · EMNLP
- (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder (ModernBERT) · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.