Named Entity Recognition

One-Line Summary: Named entity recognition (NER) identifies and classifies spans of text that refer to real-world entities such as persons, organizations, locations, dates, and other domain-specific categories.

Prerequisites: part-of-speech-tagging.md, contextual-embeddings.md, bidirectional-rnns.md, bert.md, data-annotation-and-labeling.md

What Is Named Entity Recognition?

Imagine highlighting every name, place, organization, and date in a newspaper article with different colored markers -- yellow for people, blue for locations, green for organizations, and red for dates. NER does exactly this automatically: given raw text, it identifies entity mentions and assigns each a semantic type.

Formally, NER is a sequence labeling task: given an input sequence of tokens $x_{1}, x_{2}, \dots, x_{n}$ , the model assigns a label $y_{i}$ to each token from a set that includes both entity types and a special non-entity tag ("O" for Outside). The standard label set includes PER (person), ORG (organization), LOC (location), and MISC (miscellaneous), though domain-specific schemas add types like DISEASE, CHEMICAL, GENE (biomedical), or MONEY, PERCENT, DATE (financial). NER is a foundational step in information extraction pipelines, feeding directly into relation-extraction.md, coreference-resolution.md, and knowledge-graphs-for-nlp.md.

How It Works

The BIO/IOB2 Tagging Scheme

Because entities span multiple tokens, NER uses a positional tagging scheme. In IOB2 (Inside-Outside-Beginning), each entity-starting token gets a B-TYPE tag, continuation tokens get I-TYPE, and non-entity tokens get O:

[Barack]  [Obama]  [visited]  [New]   [York]   [City]
 B-PER    I-PER    O          B-LOC   I-LOC    I-LOC

This encoding resolves adjacency ambiguity: two consecutive entities of the same type are distinguished by the B tag starting the second entity.

CRF-Based Approaches

Conditional Random Fields (CRFs) model the conditional probability of the entire label sequence given the input, capturing dependencies between adjacent labels (e.g., I-PER should not follow B-LOC). Hand-crafted features include word shapes (capitalization, digits), prefixes/suffixes, gazetteer membership, and POS tags from part-of-speech-tagging.md. CRF-based systems achieved ~89 F1 on CoNLL-2003 English NER.

BiLSTM-CRF

Lample et al. (2016) introduced the BiLSTM-CRF architecture that became the standard neural NER model for several years. The pipeline:

Character-level CNN/LSTM encodes sub-word features (capturing morphology, capitalization).
Word embeddings (pre-trained GloVe or Word2Vec) provide semantic features.
Bidirectional LSTM contextualizes each token using left and right context.
CRF layer on top models label-transition constraints, ensuring valid BIO sequences.

This architecture achieved ~91 F1 on CoNLL-2003 without hand-crafted features, demonstrating that neural representations could replace manual feature engineering.

BERT for NER

Fine-tuning BERT for NER treats each sub-word token as a classification target. Since BERT uses WordPiece tokenization, entity labels are typically assigned only to the first sub-token of each word, and the remaining sub-tokens are ignored during loss computation. BERT-large achieves ~92.8 F1 on CoNLL-2003; with document-level context and careful hyperparameter tuning, scores exceed 93 F1.

Nested and Few-Shot NER

Nested NER handles overlapping entities: in "Bank of America headquarters," both "Bank of America" (ORG) and "America" (LOC) are valid entities. Span-based models that classify all possible spans, rather than tagging individual tokens, naturally handle nesting.

Few-shot NER addresses the challenge of recognizing new entity types with minimal labeled examples. Approaches include prototype networks that learn entity-type representations from a handful of examples and prompt-based methods that reformulate NER as a text generation task using large language models.

Why It Matters

Information extraction backbone: NER is the first step in extracting structured knowledge from unstructured text, feeding relation-extraction.md and knowledge-graphs-for-nlp.md.
Search and indexing: Named entities are high-value index terms; entity-aware search dramatically improves precision for queries about specific people, places, or organizations.
Document understanding: Legal, financial, and medical documents require entity identification for automated processing (e.g., extracting party names from contracts).
Content personalization: Identifying entities in news articles enables topic linking, recommendation, and knowledge panel generation.
De-identification: In healthcare and legal domains, NER identifies protected health information (PHI) and personally identifiable information (PII) for anonymization.

Key Technical Details

CoNLL-2003 English benchmark: 4 entity types (PER, LOC, ORG, MISC); SOTA ~94 F1 (ensemble models with document context).
OntoNotes 5.0: 18 entity types; SOTA ~92 F1; more diverse and challenging than CoNLL-2003.
BiLSTM-CRF (Lample et al., 2016): ~91.0 F1 on CoNLL-2003 without external gazetteers.
BERT-large fine-tuned: ~92.8 F1 on CoNLL-2003; with additional techniques (document context, entity typing), ~93.5+ F1.
Biomedical NER (BC5CDR dataset): chemical/disease recognition at ~90 F1 using BioBERT.
Speed: CRF-based systems process ~10,000 tokens/sec on CPU; BERT NER processes ~5,000 tokens/sec on GPU.
Annotation cost: NER annotation typically requires 2--5 minutes per 100-token passage, with inter-annotator agreement (F1) around 95--97% for well-defined entity types.

Common Misconceptions

"NER is just dictionary lookup." Gazetteer-based lookup catches known entities but misses novel ones ("Elon Musk's new company Neuralink" before it was widely known), ambiguous mentions ("Apple" as company vs. fruit), and morphological variants. Context-dependent classification is essential.

"NER and entity linking are the same task." NER identifies entity mentions and their types; entity linking (or entity disambiguation) maps those mentions to specific entries in a knowledge base (e.g., mapping "Obama" to the Wikidata entity Q76). They are related but distinct pipeline stages.

"Flat NER is sufficient for all applications." Many real-world texts contain nested entities ("University of California, Berkeley" contains LOC "California" inside ORG). Flat IOB tagging cannot represent overlaps, motivating span-based and layered approaches.

"High F1 on CoNLL-2003 means NER is solved." CoNLL-2003 is a clean, news-domain benchmark. Performance drops 10--20 F1 points on noisy social media text, low-resource languages, and domains with specialized entity types (biomedical, legal). Robustness and generalization remain open challenges.

Connections to Other Concepts

NER extends the sequence labeling paradigm of part-of-speech-tagging.md to entity-level spans.
Entity mentions identified by NER feed directly into relation-extraction.md for structured knowledge extraction.
coreference-resolution.md links different NER mentions that refer to the same entity.
The BiLSTM-CRF architecture builds on bidirectional-rnns.md and long-short-term-memory.md.
Fine-tuning approaches leverage bert.md and contextual-embeddings.md.
NER is a core component of information-extraction.md and knowledge-graphs-for-nlp.md pipelines.
data-annotation-and-labeling.md covers the annotation schemes and inter-annotator agreement metrics critical for NER dataset creation.