One-Line Summary: Constituency parsing decomposes sentences into hierarchical phrase-structure trees, revealing how words group into nested constituents (noun phrases, verb phrases, etc.) according to a grammar.

Prerequisites: syntax-and-grammar.md, part-of-speech-tagging.md, n-gram-language-models.md, attention-mechanism.md

What Is Constituency Parsing?

Think of a Russian nesting doll: each doll contains smaller dolls inside it. Constituency parsing treats sentences the same way -- a sentence (S) contains a noun phrase (NP) and a verb phrase (VP); the VP contains a verb and another NP; that NP contains a determiner and a noun. Each layer of nesting is a constituent, a group of words that function as a single syntactic unit.

Formally, a constituency parse (or phrase-structure tree) is a rooted, ordered tree where leaves are the words of the sentence, internal nodes are phrasal categories (S, NP, VP, PP, ADJP, ADVP, etc.), and pre-terminal nodes are POS tags. The tree is generated by a context-free grammar (CFG) -- a set of rewrite rules like S -> NP VP, NP -> DT NN, VP -> VBD NP PP. Unlike dependency-parsing.md, which links words directly, constituency parsing introduces an explicit phrasal hierarchy that captures how word groups behave as units (e.g., "the big red ball" moves as a whole: "What did you throw? The big red ball.").

How It Works

Context-Free Grammars (CFGs)

A CFG consists of non-terminal symbols (phrasal categories), terminal symbols (words), production rules (e.g., VP -> VBD NP), and a start symbol . A sentence is grammatical if it can be derived from using the rules in . The key limitation of bare CFGs is ambiguity: a single sentence may have exponentially many parse trees.

The CKY Algorithm

The Cocke-Kasami-Younger (CKY) algorithm is a dynamic programming chart parser that finds all valid parses of a sentence under a CFG in Chomsky Normal Form (rules restricted to A -> B C or A -> w). It fills an triangular chart bottom-up in O(n^3 |G|) time, where is the sentence length and is the grammar size. Each cell [i, j] stores the set of non-terminals that can derive the span from word to word .

Probabilistic Context-Free Grammars (PCFGs)

PCFGs augment each rule with a probability: . The probability of a tree is the product of all rule probabilities used in the derivation. The CKY algorithm extends naturally to PCFGs by keeping the highest-probability derivation at each chart cell (Viterbi-style). PCFGs trained on the Penn Treebank achieve roughly 73% labeled bracketing F1 -- reasonable but far from human performance, because they fail to capture lexical dependencies (the verb "ate" prefers NP objects, while "slept" does not).

Lexicalized PCFGs (Collins, 1997; Charniak, 2000) propagate head words through the tree, conditioning rule probabilities on lexical items. This boosts F1 to ~89--90% but introduces data sparsity and slower parsing due to the enlarged rule space.

Neural Constituency Parsers

The breakthrough came with neural chart parsers. Stern et al. (2017) used a BiLSTM encoder to score each span and label, then applied CKY to find the highest-scoring tree. Kitaev & Klein (2018) refined this approach with self-attention (transformer) encoders, achieving ~95.1% F1 on PTB. Their parser scores spans using:

where are token representations from the encoder and is a label embedding. With BERT as the encoder (Kitaev et al., 2019), F1 reaches ~95.8%, and with subsequent models exceeding 96%.

Comparison to Dependency Parsing

AspectConstituencyDependency
NodesWords + phrasesWords only
RelationsHierarchical containmentHead-dependent arcs
Grammar theoryPhrase structure (Chomsky)Dependency grammar (Tesniere)
Best forFixed word order languagesFree word order languages
PTB F1/UAS~96% F1~96% UAS

Constituency and dependency trees are inter-convertible under certain assumptions (head rules), and both carry complementary syntactic information. Many NLP systems use whichever representation better suits the downstream task.

Why It Matters

  1. Grammar checking and style analysis: Constituency trees reveal whether a sentence is well-formed and identify structural patterns for style recommendations.
  2. Machine translation: Syntax-aware MT models use source-side constituency parses for reordering and hierarchical phrase extraction.
  3. Semantic parsing: Constituency structure informs compositional semantic interpretation, connecting to semantics.md.
  4. Sentiment compositionality: The Stanford Sentiment Treebank assigns sentiment labels to every node in the constituency tree, enabling recursive sentiment composition (Socher et al., 2013).
  5. Information extraction: Phrase structure identifies noun phrases that may contain entity mentions for named-entity-recognition.md.
  6. Linguistic research: Constituency parsing is the primary tool for studying phrase structure across languages and testing syntactic theories.

Key Technical Details

  • Penn Treebank (WSJ section 23): Standard English benchmark; SOTA ~96.3% labeled bracketing F1.
  • PCFG baseline: ~73% F1; lexicalized PCFGs ~89%; neural chart parsers ~95.5--96.3%.
  • Kitaev & Klein (2018): Self-attentive encoder achieving ~95.1% F1 without pre-training; ~95.8% with BERT.
  • Speed: CKY is O(n^3); neural chart parsers process ~100--300 sentences/sec on GPU for average-length sentences.
  • Cross-lingual: SPMRL shared task covers 9 morphologically-rich languages; best F1 ranges from 83% (Arabic) to 92% (Swedish).
  • Tree depth: Average PTB tree depth is ~8 for a 23-word sentence; deeper trees are harder to parse correctly.
  • Annotation cost: Constituency annotation is ~3--5x more expensive than dependency annotation per sentence, due to the hierarchical structure.

Common Misconceptions

"PCFGs capture all syntactic phenomena." PCFGs assume context-free independence -- the expansion of a node does not depend on its location in the tree. This misses lexical preferences, coordination constraints, and many other context-sensitive patterns. Lexicalized grammars and neural models address this limitation.

"Constituency parsing is obsolete in the neural era." While some end-to-end systems bypass explicit parsing, constituency trees remain valuable for interpretability, linguistic analysis, and tasks where hierarchical structure matters (e.g., discourse parsing, code generation). Neural parsers have actually revived interest by making high-quality parsing fast and accurate.

"Higher F1 means better linguistic quality." Bracketing F1 measures structural overlap between predicted and gold trees, but it overweights long spans and may not penalize linguistically significant errors (e.g., PP-attachment) proportionally. Some linguistically important distinctions have minimal impact on F1.

"Constituency parsing and dependency parsing are competing approaches." They represent different theoretical perspectives on syntax and are often complementary. Stanford Dependencies are routinely converted from Penn Treebank constituency trees, and many NLP pipelines use both representations.

Connections to Other Concepts

  • Constituency parsing implements the phrase-structure grammar formalism described in syntax-and-grammar.md.
  • It builds on part-of-speech-tagging.md -- POS tags form the pre-terminal layer of the constituency tree.
  • It complements dependency-parsing.md, which captures head-modifier relations without phrasal nodes.
  • PCFGs extend n-gram-language-models.md to hierarchical structure through probabilistic rules.
  • Neural parsers leverage attention-mechanism.md and contextual-embeddings.md for span representations.
  • Constituency trees support compositional sentiment in sentiment-analysis.md (Socher et al., 2013).
  • Phrase boundaries from constituency parsing inform semantic-role-labeling.md span identification.

Further Reading

  • Jurafsky & Martin, Speech and Language Processing (Ch. 13-14), 2024 -- comprehensive textbook treatment of CFGs, CKY, and PCFGs.
  • Collins, Three Generative, Lexicalised Models for Statistical Parsing, 1997 -- introduced lexicalized PCFGs that dominated parsing for a decade.
  • Kitaev & Klein, Constituency Parsing with a Self-Attentive Encoder, 2018 -- the neural chart parser that set a new accuracy standard.
  • Kitaev et al., Multilingual Constituency Parsing with Self-Attention and Pre-Training, 2019 -- extended the self-attentive parser with BERT for multilingual constituency parsing.
  • Stern et al., A Minimal Span-Based Neural Constituency Parser, 2017 -- demonstrated that span scoring with neural encoders could replace grammar-based chart parsing.