Constituency Parsing

One-Line Summary: Constituency parsing decomposes sentences into hierarchical phrase-structure trees, revealing how words group into nested constituents (noun phrases, verb phrases, etc.) according to a grammar.

Prerequisites: syntax-and-grammar.md, part-of-speech-tagging.md, n-gram-language-models.md, attention-mechanism.md

What Is Constituency Parsing?

Think of a Russian nesting doll: each doll contains smaller dolls inside it. Constituency parsing treats sentences the same way -- a sentence (S) contains a noun phrase (NP) and a verb phrase (VP); the VP contains a verb and another NP; that NP contains a determiner and a noun. Each layer of nesting is a constituent, a group of words that function as a single syntactic unit.

Formally, a constituency parse (or phrase-structure tree) is a rooted, ordered tree where leaves are the words of the sentence, internal nodes are phrasal categories (S, NP, VP, PP, ADJP, ADVP, etc.), and pre-terminal nodes are POS tags. The tree is generated by a context-free grammar (CFG) -- a set of rewrite rules like S -> NP VP, NP -> DT NN, VP -> VBD NP PP. Unlike dependency-parsing.md, which links words directly, constituency parsing introduces an explicit phrasal hierarchy that captures how word groups behave as units (e.g., "the big red ball" moves as a whole: "What did you throw? The big red ball.").

How It Works

Context-Free Grammars (CFGs)

A CFG $G = (N, Σ, R, S)$ consists of non-terminal symbols $N$ (phrasal categories), terminal symbols $Σ$ (words), production rules $R$ (e.g., VP -> VBD NP), and a start symbol $S$ . A sentence is grammatical if it can be derived from $S$ using the rules in $R$ . The key limitation of bare CFGs is ambiguity: a single sentence may have exponentially many parse trees.

The CKY Algorithm

The Cocke-Kasami-Younger (CKY) algorithm is a dynamic programming chart parser that finds all valid parses of a sentence under a CFG in Chomsky Normal Form (rules restricted to A -> B C or A -> w). It fills an $n \times n$ triangular chart bottom-up in O(n^3 |G|) time, where $n$ is the sentence length and $∣ G ∣$ is the grammar size. Each cell [i, j] stores the set of non-terminals that can derive the span from word $i$ to word $j$ .

Probabilistic Context-Free Grammars (PCFGs)

PCFGs augment each rule with a probability: $P (VP \to VBD NP) = 0.35$ . The probability of a tree is the product of all rule probabilities used in the derivation. The CKY algorithm extends naturally to PCFGs by keeping the highest-probability derivation at each chart cell (Viterbi-style). PCFGs trained on the Penn Treebank achieve roughly 73% labeled bracketing F1 -- reasonable but far from human performance, because they fail to capture lexical dependencies (the verb "ate" prefers NP objects, while "slept" does not).

Lexicalized PCFGs (Collins, 1997; Charniak, 2000) propagate head words through the tree, conditioning rule probabilities on lexical items. This boosts F1 to ~89--90% but introduces data sparsity and slower parsing due to the enlarged rule space.

Neural Constituency Parsers

The breakthrough came with neural chart parsers. Stern et al. (2017) used a BiLSTM encoder to score each span and label, then applied CKY to find the highest-scoring tree. Kitaev & Klein (2018) refined this approach with self-attention (transformer) encoders, achieving ~95.1% F1 on PTB. Their parser scores spans using:

$s (i, j, l) = MLP ([h_{j} - h_{i}]) \cdot e_{l}$

where $h_{i}$ are token representations from the encoder and $e_{l}$ is a label embedding. With BERT as the encoder (Kitaev et al., 2019), F1 reaches ~95.8%, and with subsequent models exceeding 96%.

Comparison to Dependency Parsing

Aspect	Constituency	Dependency
Nodes	Words + phrases	Words only
Relations	Hierarchical containment	Head-dependent arcs
Grammar theory	Phrase structure (Chomsky)	Dependency grammar (Tesniere)
Best for	Fixed word order languages	Free word order languages
PTB F1/UAS	~96% F1	~96% UAS

Constituency and dependency trees are inter-convertible under certain assumptions (head rules), and both carry complementary syntactic information. Many NLP systems use whichever representation better suits the downstream task.

Why It Matters

Grammar checking and style analysis: Constituency trees reveal whether a sentence is well-formed and identify structural patterns for style recommendations.
Machine translation: Syntax-aware MT models use source-side constituency parses for reordering and hierarchical phrase extraction.
Semantic parsing: Constituency structure informs compositional semantic interpretation, connecting to semantics.md.
Sentiment compositionality: The Stanford Sentiment Treebank assigns sentiment labels to every node in the constituency tree, enabling recursive sentiment composition (Socher et al., 2013).
Information extraction: Phrase structure identifies noun phrases that may contain entity mentions for named-entity-recognition.md.
Linguistic research: Constituency parsing is the primary tool for studying phrase structure across languages and testing syntactic theories.

Key Technical Details

Penn Treebank (WSJ section 23): Standard English benchmark; SOTA ~96.3% labeled bracketing F1.
PCFG baseline: ~73% F1; lexicalized PCFGs ~89%; neural chart parsers ~95.5--96.3%.
Kitaev & Klein (2018): Self-attentive encoder achieving ~95.1% F1 without pre-training; ~95.8% with BERT.
Speed: CKY is O(n^3); neural chart parsers process ~100--300 sentences/sec on GPU for average-length sentences.
Cross-lingual: SPMRL shared task covers 9 morphologically-rich languages; best F1 ranges from 83% (Arabic) to 92% (Swedish).
Tree depth: Average PTB tree depth is ~8 for a 23-word sentence; deeper trees are harder to parse correctly.
Annotation cost: Constituency annotation is ~3--5x more expensive than dependency annotation per sentence, due to the hierarchical structure.

Common Misconceptions

"PCFGs capture all syntactic phenomena." PCFGs assume context-free independence -- the expansion of a node does not depend on its location in the tree. This misses lexical preferences, coordination constraints, and many other context-sensitive patterns. Lexicalized grammars and neural models address this limitation.

"Constituency parsing is obsolete in the neural era." While some end-to-end systems bypass explicit parsing, constituency trees remain valuable for interpretability, linguistic analysis, and tasks where hierarchical structure matters (e.g., discourse parsing, code generation). Neural parsers have actually revived interest by making high-quality parsing fast and accurate.

"Higher F1 means better linguistic quality." Bracketing F1 measures structural overlap between predicted and gold trees, but it overweights long spans and may not penalize linguistically significant errors (e.g., PP-attachment) proportionally. Some linguistically important distinctions have minimal impact on F1.

"Constituency parsing and dependency parsing are competing approaches." They represent different theoretical perspectives on syntax and are often complementary. Stanford Dependencies are routinely converted from Penn Treebank constituency trees, and many NLP pipelines use both representations.

Connections to Other Concepts

Constituency parsing implements the phrase-structure grammar formalism described in syntax-and-grammar.md.
It builds on part-of-speech-tagging.md -- POS tags form the pre-terminal layer of the constituency tree.
It complements dependency-parsing.md, which captures head-modifier relations without phrasal nodes.
PCFGs extend n-gram-language-models.md to hierarchical structure through probabilistic rules.
Neural parsers leverage attention-mechanism.md and contextual-embeddings.md for span representations.
Constituency trees support compositional sentiment in sentiment-analysis.md (Socher et al., 2013).
Phrase boundaries from constituency parsing inform semantic-role-labeling.md span identification.