BERT: Bidirectional Encoder Representations from Transformers

One-Line Summary: BERT (Devlin et al., 2018) introduced masked language modeling and bidirectional pre-training with an encoder-only Transformer, achieving state-of-the-art results on 11 NLP tasks and triggering the "BERT-ification" of the entire field — the most influential NLP paper since the Transformer itself.

Prerequisites: 01-attention-is-all-you-need.md, 02-gpt-1.md, 05-elmo-and-contextual-embeddings.md

What Is BERT?

Imagine learning a foreign language by reading sentences with random words blacked out and trying to guess the missing words. Unlike reading left to right and predicting the next word (as GPT does), you look at the words both before and after the blank to figure out what goes there. "The ___ sat on the mat" could be "cat," "dog," or "baby" — but "The ___ sat on the mat and purred" is almost certainly "cat." By using context from both directions simultaneously, you develop a richer understanding of each word's meaning in context.

This is exactly what BERT does. In October 2018, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI published what would become the most influential NLP paper of the decade. BERT's core innovation was deceptively simple: randomly mask 15% of input tokens, then train a bidirectional Transformer to predict the masked tokens using context from both left and right. This Masked Language Modeling (MLM) objective allowed the model to build deep bidirectional representations — something GPT-1's causal (left-to-right) objective fundamentally could not do.

The impact was immediate and overwhelming. BERT set new state-of-the-art results on 11 NLP benchmarks simultaneously. Within months, nearly every NLP system was being rebuilt on top of BERT. The term "BERT-ification" entered the vocabulary of the field. Google deployed BERT in Search in October 2019, calling it the biggest improvement to search in five years. With over 60,000 citations, BERT is one of the most cited papers in all of computer science.

How It Works

  BERT vs GPT: Bidirectional vs Unidirectional Context
 
  GPT (Decoder, Causal):            BERT (Encoder, Bidirectional):
  "The [?] sat on the mat"          "The [MASK] sat on the mat"
 
  The ──▶ ?                          The ◀──▶ [MASK] ◀──▶ sat
  (only sees "The")                  (sees ALL surrounding words)
 
  BERT Pre-training: Masked Language Modeling (MLM)
 
  Input:    The  cat  [MASK]  on  the  [MASK]
                        │                │
  Model:  ┌─────────────────────────────────────┐
          │     Bidirectional Transformer        │
          │     (every token sees every token)   │
          └────────────────┬────────────────────┘
                           │
  Predict:           "sat"          "mat"
                   (15% of tokens masked and predicted)
 
  BERT Fine-tuning: One Model, Many Tasks
 
  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
  │Classification │  │     QA       │  │     NER      │
  │               │  │              │  │              │
  │ [CLS] ──▶ +/-│  │ token₁...₅   │  │ tok: B-PER   │
  │ (one head)    │  │ ──▶ start/end│  │ ──▶ per-token│
  └──────────────┘  └──────────────┘  └──────────────┘

Figure: BERT's masked language modeling allows each token to attend to all other tokens (bidirectional). Fine-tuning adds a simple task-specific head on top of the same pre-trained model.

Architecture: Encoder-Only Transformer

BERT uses only the encoder portion of the original Transformer, discarding the decoder and cross-attention entirely. It came in two sizes:

BERT-Base: 12 layers, 12 attention heads, d_model=768, 110M parameters
BERT-Large: 24 layers, 16 attention heads, d_model=1024, 340M parameters

The architecture is virtually identical to GPT-1's Transformer layers, but with one critical difference: no causal mask. Every position can attend to every other position in the sequence, giving the model full bidirectional context. This is why BERT is encoder-only — decoders use causal masking for autoregressive generation, which BERT doesn't need.

Masked Language Modeling (MLM)

The primary pre-training objective randomly selects 15% of input tokens for prediction. Of those selected tokens: 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. This distribution was carefully chosen: if all selected tokens were masked, the model would never learn to handle non-masked tokens at fine-tuning time (where no tokens are masked). The random replacement adds noise that forces robust representations.

The model predicts the original token for each selected position using a softmax over the vocabulary, trained with cross-entropy loss. Because the model doesn't know which tokens have been altered, it must maintain a good representation for every position — not just the masked ones.

Next Sentence Prediction (NSP)

The secondary pre-training objective was Next Sentence Prediction: given two sentences A and B, predict whether B actually follows A in the original text (50% positive, 50% random negatives). The input format was: [CLS] sentence A [SEP] sentence B [SEP]. This was intended to help with tasks requiring understanding of sentence-pair relationships (entailment, question answering).

NSP would later prove to be BERT's most controversial design choice. 01-roberta.md demonstrated that removing NSP actually improved performance, suggesting it was either insufficiently challenging or actively harmful as a pre-training signal.

Pre-training Details

BERT was pre-trained on BooksCorpus (800M words) + English Wikipedia (2,500M words) — about 3.3 billion words total (16GB of text). Training used 256 sequences of 512 tokens per batch (effectively 128K tokens per batch). BERT-Large was trained for 1M steps on 16 TPU chips (4 TPU pods) for approximately 4 days. The optimizer was Adam with learning rate warm-up and linear decay.

Fine-tuning: Elegant Simplicity

BERT's fine-tuning approach was remarkably simple: add a single task-specific output layer and fine-tune the entire model end-to-end. For classification, use the [CLS] token's representation. For token-level tasks (NER, QA), use the corresponding token representations. Fine-tuning typically took 1-3 epochs on a single GPU, requiring only hours — making BERT accessible to the broader research community.

Why It Matters

The BERT-ification of NLP

BERT didn't just set new records — it made previous approaches obsolete overnight. Research groups worldwide abandoned their existing models and rebuilt on BERT. The GLUE benchmark, which had been a competitive leaderboard for years, was quickly saturated, prompting the creation of SuperGLUE (harder tasks). Entire subfields of NLP — NER, sentiment analysis, question answering, relation extraction — converged on "fine-tune BERT" as the default methodology.

Bidirectional Context as a Paradigm

BERT demonstrated that bidirectional context was crucial for deep language understanding. Consider "I went to the bank to deposit my check" versus "I went to the bank of the river." A left-to-right model must process "bank" before seeing "deposit" or "river" — it must make a preliminary judgment that may be wrong. BERT sees the entire sentence simultaneously, allowing it to resolve ambiguity from the start. This advantage was real and measurable across virtually every understanding task.

Democratizing NLP Research

BERT was released as open-source with pre-trained weights, making state-of-the-art NLP accessible to anyone with a single GPU. Fine-tuning took hours, not weeks. This democratization led to an explosion of BERT-based research and applications. The Hugging Face Transformers library, initially built around BERT, grew into the dominant platform for NLP model sharing and deployment.

Key Technical Details

Paper: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Oct 2018, arXiv:1810.04805, NAACL 2019)
BERT-Base: 12 layers, 12 heads, 768 hidden, 110M params
BERT-Large: 24 layers, 16 heads, 1024 hidden, 340M params
Pre-training data: BooksCorpus (800M words) + English Wikipedia (2.5B words) = ~3.3B words, ~16GB text
Training: 4 days on 16 TPU chips (64 TPU v2 chips for BERT-Large); estimated cost ~ $10 K -$ 50K
MLM masking: 15% of tokens selected; 80% masked, 10% random, 10% unchanged
Context window: 512 tokens (with segment embeddings for two-sentence inputs)
Results: SOTA on 11 tasks including GLUE (80.5 → 82.1), SQuAD v1.1 (F1: 91.2 → 93.2), SQuAD v2.0 (F1: 83.0 → 86.3)
Citations: Over 60,000 (among the top 10 most cited CS papers of all time)
Google Search deployment: October 2019; called "the biggest improvement to search in five years"

Common Misconceptions

"BERT is a generative model." BERT cannot generate text autoregressively. It is an encoder-only model designed for understanding tasks — classification, extraction, matching. For generation, you need decoder-only models like 02-gpt-1.md or encoder-decoder models like 05-t5-text-to-text-framework.md.
"BERT's bidirectional attention sees the future." In a sense, yes — BERT processes the entire input simultaneously, so position 5 can attend to position 10. But this is not "seeing the future" in the autoregressive sense. BERT processes a fixed input; it doesn't generate tokens one by one. The distinction is between understanding a given text (BERT) and generating new text (GPT).
"BERT is the best model for all NLP tasks." BERT excels at understanding tasks (classification, NER, QA) but cannot generate text. For generation, summarization, or translation, encoder-decoder (05-t5-text-to-text-framework.md) or decoder-only models are necessary. Even for understanding tasks, later models like 01-roberta.md and 04-deberta.md outperformed BERT.
"Masked Language Modeling was a novel idea." The cloze task (fill in the blank) dates back to Taylor (1953) in psycholinguistics. Denoising autoencoders had been used in deep learning. BERT's innovation was applying this idea to Transformer pre-training at scale, combined with the bidirectional attention that made it powerful.
"Next Sentence Prediction was essential to BERT's success." 01-roberta.md showed that removing NSP improved performance. NSP was likely too easy (random negative sentence pairs were trivially distinguishable) and may have diluted the training signal.

Connections to Other Concepts

Built on the Transformer encoder from 01-attention-is-all-you-need.md
Responded to GPT-1's unidirectional limitation (02-gpt-1.md) with bidirectional context
Extended the contextual embedding idea from 05-elmo-and-contextual-embeddings.md using Transformers instead of LSTMs
Spawned an entire ecosystem: 01-roberta.md, 02-albert.md, 03-distilbert.md, 04-deberta.md, 05-electra.md
The encoder-only vs decoder-only debate is analyzed in 07-encoder-vs-decoder-vs-encoder-decoder.md
Encoder models continued to evolve via 06-modernbert-and-the-encoder-revival.md
For deep coverage of MLM and training objectives, see llm-concepts/pre-training-and-fine-tuning.md