Naive Bayes
A 250-year-old theorem plus one cheeky simplifying assumption produces a classifier that beat every alternative at spam filtering for two decades. The math is simple. The intuition is durable.
The five-bullet version
- Bayes’ theorem rearranges P(class | features) using class likelihoods and a prior.
- Naive Bayes assumes features are independent given the class — wrong, but useful.
- Training is just counting. Inference is a sum of log-likelihoods.
- Laplace smoothing prevents zero probabilities from breaking the product.
- Still a reasonable baseline for text classification, anomaly detection, and any high-feature, low-data problem.
§ 00 · BAYES’ THEOREM IN ONE LINERearranging a joint probability
Bayes’ theoremBayes' theorem. P(A | B) = P(B | A) · P(A) / P(B). Lets you invert a conditional probability — go from 'how likely is the evidence given the hypothesis' to 'how likely is the hypothesis given the evidence'. is a piece of algebra you can derive in two lines. For two events A and B:
P(class | features) = P(features | class) · P(class) / P(features)
For classification, the denominator P(features) is the same across every class — so we can ignore it for comparing classes. We want to find the class with the highest P(features | class) · P(class).
§ 01 · WHY ‘NAIVE’?The independence assumption
For a message with words w₁, w₂, …, wₙ, computing P(w₁, w₂, …, wₙ | class) is hopeless — the joint probability of every word combination would require an astronomical amount of data.
The naive assumption: pretend the words are independent given the class. Then the joint factorizes:
P(w₁, …, wₙ | class) ≈ P(w₁ | class) · P(w₂ | class) · … · P(wₙ | class)
This is almost always false. “New York” — the second word depends on the first. “Free shipping” — same thing. The words in real messages are wildly correlated. But the assumption is useful: it turns an unworkable joint probability into a product of easy-to-estimate marginals.
§ 02 · A SPAM CLASSIFIER FROM SCRATCHThe whole algorithm
Training. For each class (spam, ham), and each word in the vocabulary, count how many times it appears across all messages of that class. Compute P(word | class) = count / total. Also compute the prior P(class) as the fraction of messages in each class.
Inference. For a new message, take the tokens. Compute, for each class:
log P(class) + Σ log P(wᵢ | class)
Pick the class with the highest score. That’s it. No iteration, no gradient descent, no hyperparameters. Training is one pass over the data; inference is a sum.
Click a vocab word to add it, click a chip to remove. Each token multiplies the class likelihood by its conditional probability. Logs prevent underflow; the final normalization puts the two posteriors on a 0–1 scale.
§ 03 · SMOOTHING AND ZEROSThe trap of unseen words
One subtlety: if a word never appeared in training data for some class, then P(word | class) = 0. In the product, one zero kills the entire probability — even if the other 30 words strongly suggest that class.
Fix: Laplace smoothingLaplace smoothing. Add a small constant (typically 1) to every count before normalizing. Ensures no zero probabilities. The classic implementation of Naive Bayes for text.. Add 1 to every count (including zero-counts) before normalizing. That gives every word a small probability in every class, just from the corrected denominator.
P(word | class) = (count + 1) / (total + V)
§ 04 · WHEN THIS STILL EARNS ITS KEEPA 250-year-old algorithm in 2026
Despite living in an era of giant neural networks, Naive Bayes is still a reasonable choice for:
- Text classification baselines.Quick, interpretable starting point. If your fancy model can’t beat Naive Bayes by 5+ points, something is wrong.
- High-feature, low-data regimes.Counting works when you have thousands of features and hundreds of examples. Most ML methods don’t.
- Anomaly / novelty detection. Learn one class from examples; flag low-likelihood inputs.
- Embedded / real-time systems. Inference is a few additions. Runs on a microcontroller.
§ 05 · TAKING THIS FORWARDWhere to go from here
Naive Bayes is the gentle starting point for probabilistic classification. From here, the natural progressions are: logistic regression (drop the independence assumption, fit weights jointly), decision trees and random forests (handle nonlinearity), and eventually neural networks (learn the features). Each gives up interpretability for capacity. Naive Bayes keeps both — at the cost of asking the data to be approximately what it pretended to be.
§ · GOING DEEPERSmoothing, the multinomial model, and the modern baseline
Two implementation details make Naive Bayes work in practice. First, Laplace smoothing: add 1 to every count before normalizing, so unseen words get a small nonzero probability and don’t zero out the entire class score. Second, log-space arithmetic: products of many small probabilities underflow; sums of log-probs don’t.
The multinomialvariant (McCallum & Nigam 1998) — count word frequencies — outperforms Bernoulli (presence only) on most text tasks. Wang & Manning (2012) showed that Naive Bayes with bigram features is still a remarkably strong baseline for short-text classification — competitive with early deep learning approaches on standard benchmarks. The independence assumption is wildly violated; the model wins anyway because the classifier only needs class rankings to be right, not the calibration. A useful lesson about “wrong but useful.”
§ · FURTHER READINGReferences & deeper sources
- (1998). A Comparison of Event Models for Naive Bayes Text Classification · AAAI
- (2012). Baselines and Bigrams: Simple, Good Sentiment and Topic Classification · ACL
- (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers · ICML
- (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss · Machine Learning
- (2008). Introduction to Information Retrieval, ch. 13 (Text classification & Naive Bayes) · Cambridge University Press
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.