One-Line Summary: Text classification assigns one or more predefined category labels to a document, sentence, or passage, serving as the most widely deployed NLP capability in production systems.
Prerequisites: bag-of-words.md, tf-idf.md, word2vec.md, contextual-embeddings.md, convolutional-models-for-text.md, long-short-term-memory.md
What Is Text Classification?
Imagine a mail room clerk who reads every incoming letter and drops it into the correct department bin -- marketing, legal, support, or spam. Text classification is the automated version of that clerk: given a piece of text, the model assigns it to one or more predefined categories.
More formally, text classification maps an input text (a sequence of tokens) to a label from a fixed label set. When exactly one label is assigned, the task is multi-class classification; when multiple labels may apply simultaneously, it becomes multi-label classification. Despite its apparent simplicity, text classification underpins spam filtering, topic routing, intent detection in virtual assistants, content moderation, and medical coding -- making it arguably the single most commercially valuable NLP task.
How It Works
Traditional Approaches
Naive Bayes applies Bayes' theorem with a conditional independence assumption over features. Given a document and class :
Despite its simplifying assumption, Multinomial Naive Bayes trained on TF-IDF features achieves roughly 88--90% accuracy on standard topic classification benchmarks and remains a strong baseline for high-dimensional, sparse feature spaces.
Support Vector Machines (SVMs) with linear kernels over TF-IDF vectors were the dominant approach from roughly 2002--2014. SVMs find a maximum-margin hyperplane separating classes and handle high-dimensional feature spaces gracefully. Joachims (1998) demonstrated that linear SVMs were particularly well-suited to text, where feature dimensionality (vocabulary size) often exceeds the number of training examples.
Neural Approaches
Convolutional Neural Networks (CNNs): Kim (2014) showed that a single-layer CNN with multiple filter widths (3, 4, 5) over pre-trained word embeddings achieves strong results on sentence classification. Filters act as n-gram detectors, and max-over-time pooling extracts the most salient feature from each filter map.
Recurrent Networks (LSTMs/BiLSTMs): Bidirectional LSTMs process text left-to-right and right-to-left, then use the final hidden states (or an attention-weighted combination) as the document representation. This captures word-order and long-range dependencies that bag-of-words models miss.
Transformer Fine-Tuning (BERT and beyond): Devlin et al. (2019) introduced the fine-tuning paradigm where the [CLS] token representation from BERT is passed through a task-specific classification head. Fine-tuning BERT-base (110M parameters) on AG News achieves ~94.5% accuracy, roughly a 2-point improvement over the best CNN baselines. Larger models like RoBERTa and DeBERTa push this further.
Multi-Label and Hierarchical Classification
In multi-label settings, each label receives an independent sigmoid output rather than a shared softmax, so multiple labels can be active simultaneously. Hierarchical classification introduces a label taxonomy (e.g., Science > Physics > Quantum Mechanics) and may use hierarchical loss functions or per-level classifiers to enforce consistency across the tree.
Why It Matters
- Highest deployment volume: Text classification is the most commonly deployed NLP model in industry -- every spam filter, content moderator, and intent router is a classifier.
- Gateway task: Classification is typically the first NLP task engineers tackle, making it the entry point for applied NLP adoption.
- Modular building block: Many complex systems (e.g., dialogue systems, document processing pipelines) use classifiers internally for routing, filtering, or triggering downstream components.
- Benchmarking foundation: Classification accuracy on standard datasets is the primary yardstick for comparing text representations and pre-trained models.
- Business-critical applications: Medical coding (ICD classification), legal document triage, financial sentiment, and customer support routing all depend on robust text classification.
Key Technical Details
- AG News benchmark: 4-class news classification; BERT achieves ~94.5% accuracy, linear SVM ~92%.
- IMDB sentiment (binary): BERT-family models reach ~95.5% accuracy; a simple BiLSTM with attention achieves ~90%.
- SST-2 (Stanford Sentiment Treebank, binary): SOTA is ~97% accuracy (DeBERTa-v3).
- SST-5 (fine-grained, 5-class): SOTA around ~59% accuracy, reflecting the inherent difficulty of fine-grained distinctions.
- Training data requirements: Naive Bayes can work with as few as 100 labeled examples per class; BERT fine-tuning typically needs 1,000+ per class for strong results, though few-shot prompting can reduce this.
- Inference speed: Naive Bayes/SVM classify a document in <1 ms; BERT-base takes ~10 ms on GPU per document (batch size 1).
- Multi-label metrics: Accuracy is replaced by micro/macro F1, subset accuracy, and Hamming loss.
Common Misconceptions
"More data always beats a better algorithm." While more labeled data helps, architecture matters significantly. BERT fine-tuned on 5,000 examples often outperforms an SVM trained on 50,000, because pre-training captures distributional knowledge from billions of tokens. The pre-training data effectively acts as an enormous unlabeled dataset.
"Text classification is a solved problem." Accuracy on clean benchmarks is high, but real-world challenges -- domain shift, class imbalance (e.g., 0.1% fraud in transactions), adversarial inputs, and evolving label definitions -- mean production classifiers require continual monitoring and retraining.
"Deep learning always outperforms traditional methods." For small datasets (<1,000 examples), high-dimensional sparse features, or when interpretability is required, Naive Bayes and SVMs remain competitive. A well-tuned TF-IDF + SVM pipeline is a responsible first baseline before reaching for BERT.
"Multi-class and multi-label are interchangeable terms." Multi-class assigns exactly one label from k choices (softmax output). Multi-label allows zero, one, or many labels to be active simultaneously (independent sigmoid outputs). Confusing the two leads to incorrect loss functions and evaluation metrics.
Connections to Other Concepts
- Text classification consumes the representations produced by
tf-idf.md,word2vec.md, andcontextual-embeddings.mdas input features. sentiment-analysis.mdis a specialized form of text classification where labels represent opinion polarity or emotion.convolutional-models-for-text.mddescribes the CNN architectures (Kim, 2014) that were a breakthrough for sentence-level classification.long-short-term-memory.mdandbidirectional-rnns.mdprovide the recurrent building blocks for sequence-aware classification.bert.mdandtransfer-learning-in-nlp.mdexplain the pre-train-then-fine-tune paradigm that now dominates classification.- Intent classification within
dialogue-systems.mdis a direct application of text classification. named-entity-recognition.mdcan be viewed as token-level classification, extending the document-level framing described here.
Further Reading
- Joachims, Text Categorization with Support Vector Machines, 1998 -- foundational work establishing SVMs as the dominant text classification method for over a decade.
- Kim, Convolutional Neural Networks for Sentence Classification, 2014 -- introduced the simple yet effective single-layer CNN architecture for text.
- Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019 -- established the fine-tuning paradigm that redefined text classification baselines.
- Sun et al., How to Fine-Tune BERT for Text Classification, 2019 -- practical guide to fine-tuning strategies, learning rates, and layer freezing.
- Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019 -- showed that careful hyperparameter tuning and more pre-training data yield consistent gains across classification tasks.