Stopword Removal

One-Line Summary: Filtering high-frequency function words (the, is, at, which) that carry little semantic content to reduce noise and dimensionality in frequency-based text representations, though modern neural models often benefit from retaining them.

Prerequisites: tokenization-in-nlp.md, text-normalization.md, bag-of-words.md, tf-idf.md

What Is Stopword Removal?

Imagine listening to a speech and trying to take notes on the key ideas. You would naturally skip filler words like "the," "is," and "very" and write down the content-bearing words: nouns, verbs, adjectives. Stopword removal automates this for NLP: it filters out tokens from a predefined list of high-frequency, low-information words before text is fed into a model.

The term "stop list" was coined by Hans Peter Luhn in 1958 at IBM. The idea is straightforward: in English, the 100 most frequent words account for approximately 50% of all word occurrences in a typical corpus, yet they contribute almost nothing to distinguishing one document from another. By removing them, you reduce feature dimensionality and noise, allowing models to focus on the words that actually differentiate documents.

But this is a simplification. Whether a word is "low-information" depends entirely on the task. The word "not" is a stopword in most lists, yet removing it from "this movie is not good" flips the sentiment entirely. The history of stopword removal in NLP is one of gradual retreat: once universal, it is now applied selectively and sometimes not at all.

How It Works

Standard Stopword Lists

NLTK

NLTK provides stopword lists for 23 languages. The English list contains 179 words:

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
# {'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
#  'you', 'the', 'a', 'an', 'is', 'are', 'was', ...}

spaCy

spaCy's English stopword list contains 326 words, including single-letter tokens and some additional function words. Each token in spaCy has an is_stop attribute:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a very important document")
[tok.text for tok in doc if not tok.is_stop]
# ['important', 'document']

scikit-learn

scikit-learn's CountVectorizer and TfidfVectorizer accept stop_words='english', which uses a list of 318 words derived from the Glasgow Information Retrieval Group.

The Lists Differ

The three lists above overlap significantly but disagree on edge cases. "about" is a stopword in all three. "computer" is in none. But "system" is in scikit-learn's list (a legacy decision from IR research on computing literature) and absent from NLTK and spaCy. This inconsistency means that switching libraries can silently change preprocessing behavior.

Frequency-Based Stopword Detection

Instead of relying on a fixed list, you can compute stopwords from the corpus itself. Words appearing in more than a threshold percentage of documents (e.g., 85%+ of documents) are candidates. This adapts to domain-specific language: in a medical corpus, "patient" appears in nearly every document and functions like a stopword, even though it is absent from general-purpose lists.

Domain-Specific Stopwords

Legal text: "court," "plaintiff," "defendant" may appear in every document. Scientific abstracts: "study," "results," "method" are ubiquitous. Adding domain-specific stopwords can improve topic model coherence (see topic-modeling.md) by 10-15% as measured by NPMI scores.

Implementation Pattern

A typical pipeline applies stopword removal after tokenization and lowercasing:

tokens = ["the", "cat", "sat", "on", "the", "mat"]
filtered = [t for t in tokens if t not in stops]
# ["cat", "sat", "mat"]

Some implementations also remove single-character tokens and pure-digit tokens as pseudo-stopwords.

Why It Matters

Dimensionality reduction: Removing stopwords from a Bag-of-Words representation (see bag-of-words.md) can eliminate 20-30% of vocabulary terms, reducing memory and computation for models like Naive Bayes and SVM.
Signal-to-noise ratio: In TF-IDF (see tf-idf.md), stopwords already receive low weights, but explicit removal prevents them from consuming feature slots in fixed-size representations.
Topic model quality: LDA (see topic-modeling.md) trained without stopword removal produces topics dominated by function words, obscuring meaningful themes.
Index size: In information retrieval (see information-retrieval.md), stopword removal reduces index size by 25-30%, speeding up query processing.
Historical significance: Understanding stopword removal is essential for reproducing results from pre-2018 NLP papers, which almost universally applied it.

Key Technical Details

The 10 most frequent English words ("the," "be," "to," "of," "and," "a," "in," "that," "have," "I") account for approximately 25% of all tokens in the British National Corpus.
Removing NLTK's 179 stopwords from a typical English news corpus reduces total token count by 40-50%.
On the 20 Newsgroups classification benchmark, stopword removal improves SVM + TF-IDF accuracy by approximately 1-2% but has no significant effect on BERT fine-tuning (Devlin et al., 2019).
For phrase-based queries ("to be or not to be"), stopword removal destroys the query entirely. Modern search engines like Elasticsearch keep stopwords in phrase indexes.
Stopword lists for morphologically rich languages (Finnish, Turkish) are less effective because function words take many inflected forms. A list of 100 base forms may miss hundreds of inflected variants.
Zipf's law predicts that stopword frequency follows a power law: the most frequent word appears twice as often as the second most frequent, three times as often as the third, and so on.

Common Misconceptions

"Stopword removal is always beneficial." For neural models like BERT (see bert.md) and GPT (see gpt-for-nlp-tasks.md), removing stopwords generally hurts performance. These models learn to use function words as syntactic scaffolding -- "not," "but," and "despite" carry important signals for understanding sentence structure and sentiment.

"There is a single correct stopword list." Every list is an engineering decision. The SMART system (1960s) used 571 stopwords; NLTK uses 179; some researchers use as few as 25. The optimal list depends on the task, domain, and model.

"Stopwords carry no information." The word "the" before a noun phrase signals definiteness. "A" vs. "the" distinguishes new vs. given information. Discourse analysis and pragmatics (see pragmatics-and-discourse.md) rely on these distinctions. Stylometric analysis (authorship attribution) actively uses function word frequencies as features because they are difficult to consciously manipulate.

"TF-IDF makes stopword removal unnecessary." While TF-IDF downweights common terms, very frequent words still consume memory and can affect computational efficiency. For large-scale systems indexing billions of documents, explicit removal remains practical.

Connections to Other Concepts

Stopword removal follows tokenization-in-nlp.md and text-normalization.md (specifically case folding, since lists assume lowercase input).
It directly reduces feature space for bag-of-words.md and tf-idf.md representations.
stemming-and-lemmatization.md is often applied in conjunction: stem first, then remove stopwords, or vice versa (order matters -- stemmed stopwords may not match the list).
For topic-modeling.md, stopword removal is nearly essential; LDA topics without it are dominated by "the," "is," "was."
Modern neural models (bert.md, contextual-embeddings.md) typically skip stopword removal entirely, marking a paradigm shift in preprocessing philosophy.
information-retrieval.md systems manage the stopword trade-off by removing them from inverted indexes but keeping them in positional indexes for phrase matching.
regular-expressions-for-nlp.md can be used to build dynamic stopword filters based on pattern matching.