Core Concepts · Module 14·9 min read

Naive Bayes

A 250-year-old theorem plus one cheeky simplifying assumption produces a classifier that beat every alternative at spam filtering for two decades. The math is simple. The intuition is durable.

The five-bullet version

  • Bayes’ theorem rearranges P(class | features) using class likelihoods and a prior.
  • Naive Bayes assumes features are independent given the class — wrong, but useful.
  • Training is just counting. Inference is a sum of log-likelihoods.
  • Laplace smoothing prevents zero probabilities from breaking the product.
  • Still a reasonable baseline for text classification, anomaly detection, and any high-feature, low-data problem.

§ 00 · BAYES’ THEOREM IN ONE LINERearranging a joint probability

Bayes’ theoremBayes' theorem. P(A | B) = P(B | A) · P(A) / P(B). Lets you invert a conditional probability — go from 'how likely is the evidence given the hypothesis' to 'how likely is the hypothesis given the evidence'. is a piece of algebra you can derive in two lines. For two events A and B:

P(class | features) = P(features | class) · P(class) / P(features)

For classification, the denominator P(features) is the same across every class — so we can ignore it for comparing classes. We want to find the class with the highest P(features | class) · P(class).

§ 01 · WHY ‘NAIVE’?The independence assumption

For a message with words w₁, w₂, …, wₙ, computing P(w₁, w₂, …, wₙ | class) is hopeless — the joint probability of every word combination would require an astronomical amount of data.

The naive assumption: pretend the words are independent given the class. Then the joint factorizes:

P(w₁, …, wₙ | class) ≈ P(w₁ | class) · P(w₂ | class) · … · P(wₙ | class)

This is almost always false. “New York” — the second word depends on the first. “Free shipping” — same thing. The words in real messages are wildly correlated. But the assumption is useful: it turns an unworkable joint probability into a product of easy-to-estimate marginals.

§ 02 · A SPAM CLASSIFIER FROM SCRATCHThe whole algorithm

Training. For each class (spam, ham), and each word in the vocabulary, count how many times it appears across all messages of that class. Compute P(word | class) = count / total. Also compute the prior P(class) as the fraction of messages in each class.

Inference. For a new message, take the tokens. Compute, for each class:

log P(class) + Σ log P(wᵢ | class)

Pick the class with the highest score. That’s it. No iteration, no gradient descent, no hyperparameters. Training is one pass over the data; inference is a sum.

Lab · naive bayes spam filterA tiny corpus, real probabilities
Message tokens
P(spam | tokens)
99.7%
P(ham | tokens)
0.3%

Click a vocab word to add it, click a chip to remove. Each token multiplies the class likelihood by its conditional probability. Logs prevent underflow; the final normalization puts the two posteriors on a 0–1 scale.

§ 03 · SMOOTHING AND ZEROSThe trap of unseen words

One subtlety: if a word never appeared in training data for some class, then P(word | class) = 0. In the product, one zero kills the entire probability — even if the other 30 words strongly suggest that class.

Fix: Laplace smoothingLaplace smoothing. Add a small constant (typically 1) to every count before normalizing. Ensures no zero probabilities. The classic implementation of Naive Bayes for text.. Add 1 to every count (including zero-counts) before normalizing. That gives every word a small probability in every class, just from the corrected denominator.

P(word | class) = (count + 1) / (total + V)

§ 04 · WHEN THIS STILL EARNS ITS KEEPA 250-year-old algorithm in 2026

Despite living in an era of giant neural networks, Naive Bayes is still a reasonable choice for:

CHECKWhy does Naive Bayes often work despite the independence assumption being obviously wrong?

§ 05 · TAKING THIS FORWARDWhere to go from here

Naive Bayes is the gentle starting point for probabilistic classification. From here, the natural progressions are: logistic regression (drop the independence assumption, fit weights jointly), decision trees and random forests (handle nonlinearity), and eventually neural networks (learn the features). Each gives up interpretability for capacity. Naive Bayes keeps both — at the cost of asking the data to be approximately what it pretended to be.

§ · GOING DEEPERSmoothing, the multinomial model, and the modern baseline

Two implementation details make Naive Bayes work in practice. First, Laplace smoothing: add 1 to every count before normalizing, so unseen words get a small nonzero probability and don’t zero out the entire class score. Second, log-space arithmetic: products of many small probabilities underflow; sums of log-probs don’t.

The multinomialvariant (McCallum & Nigam 1998) — count word frequencies — outperforms Bernoulli (presence only) on most text tasks. Wang & Manning (2012) showed that Naive Bayes with bigram features is still a remarkably strong baseline for short-text classification — competitive with early deep learning approaches on standard benchmarks. The independence assumption is wildly violated; the model wins anyway because the classifier only needs class rankings to be right, not the calibration. A useful lesson about “wrong but useful.”

§ · FURTHER READINGReferences & deeper sources

  1. McCallum, Nigam (1998). A Comparison of Event Models for Naive Bayes Text Classification · AAAI
  2. Wang, Manning (2012). Baselines and Bigrams: Simple, Good Sentiment and Topic Classification · ACL
  3. Rennie, Shih, Teevan, Karger (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers · ICML
  4. Domingos, Pazzani (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss · Machine Learning
  5. Manning, Raghavan, Schütze (2008). Introduction to Information Retrieval, ch. 13 (Text classification & Naive Bayes) · Cambridge University Press

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.