One-Line Summary: Finding the maximum-margin hyperplane that separates classes -- elegant geometry with strong theoretical guarantees.
Prerequisites: Linear algebra, optimization (constrained), hyperplanes, convex optimization basics, Lagrange multipliers.
What Is a Support Vector Machine?
Imagine you have red and blue marbles on a table and want to place a ruler between them so that it separates the two colors. Many ruler positions work, but which is best? Intuitively, you want the ruler to be as far as possible from the closest marble on either side. Support Vector Machines (SVMs) formalize this idea: they find the hyperplane that maximizes the margin -- the distance to the nearest data point from either class.
Formally, given labeled training data with and , the SVM finds parameters and defining the hyperplane that maximizes the margin while correctly classifying all training points.
How It Works
Maximum Margin Intuition
The distance from a point to the hyperplane is . For correctly classified points, , so the distance is . The margin is twice the distance to the closest point:
Maximizing is equivalent to minimizing .
Hard-Margin SVM Formulation
When the data is linearly separable, the hard-margin SVM solves:
This is a convex quadratic program with linear inequality constraints. The constraint ensures every point is on the correct side of the margin.
Support Vectors
The solution depends only on the training points that lie exactly on the margin boundary (where ). These are the support vectors. All other points could be moved or removed without changing the solution. This sparsity is a key property: the model complexity depends on the number of support vectors, not the total training set size.
Soft-Margin SVM
Real data is rarely perfectly separable. The soft-margin SVM introduces slack variables that allow violations of the margin:
The parameter controls the tradeoff:
- Large : Small margin, few violations -- risk of overfitting.
- Small : Large margin, more violations tolerated -- risk of underfitting.
A point with is correctly classified outside the margin. A point with is correctly classified but inside the margin. A point with is misclassified.
Hinge Loss Interpretation
The soft-margin objective can be rewritten without explicit constraints as:
The function is the hinge loss. It is zero when (correct classification with margin) and increases linearly when . Compare this to logistic regression's cross-entropy loss, which is smooth and never exactly zero. The hinge loss is what produces the sparsity of support vectors: points with contribute zero gradient.
The Dual Formulation
Using Lagrange multipliers , the SVM dual is:
The dual formulation is important for two reasons. First, the data enters only through inner products , which enables the kernel trick (see Kernel Methods). Second, only for support vectors, making the solution sparse. The decision function becomes:
where is the set of support vectors.
SVM vs. Logistic Regression
Both find linear decision boundaries, but they differ in important ways:
| Aspect | SVM | Logistic Regression |
|---|---|---|
| Loss | Hinge (flat beyond margin) | Log loss (always nonzero) |
| Probabilities | Not natively | Yes |
| Sparsity | Only support vectors matter | All points contribute |
| Optimization | Quadratic program | Unconstrained convex |
| Kernel extension | Natural via dual | Possible but less standard |
In practice, their accuracy is often comparable on linearly separable or near-separable data. Logistic regression is preferred when calibrated probabilities are needed; SVMs are preferred when the kernel trick is beneficial.
VC Dimension and Generalization
SVMs have strong theoretical backing through VC (Vapnik-Chervonenkis) theory. The generalization error is bounded by:
where is the VC dimension. For maximum-margin classifiers, the VC dimension depends on the margin, not the input dimensionality. A large margin implies a smaller effective VC dimension, which implies better generalization. This is why SVMs can work well even in very high dimensions -- the margin-based complexity control is independent of .
Why It Matters
SVMs were the dominant classification method from the mid-1990s through the late 2000s, before deep learning took over many tasks. They remain excellent for medium-sized datasets, high-dimensional problems (e.g., text, genomics), and situations where theoretical guarantees matter. The kernel trick extends SVMs to nonlinear boundaries without ever computing explicit feature mappings, making them remarkably flexible. SVMs also influenced modern machine learning theory profoundly: concepts like margin, structural risk minimization, and kernel methods originated or were popularized through SVM research.
Key Technical Details
- Optimization: The primal is a QP; the dual is also a QP. Efficient solvers include SMO (Sequential Minimal Optimization) and libSVM.
- Scaling: Standard SVMs scale to in memory and time due to the kernel matrix. For large datasets, linear SVMs (liblinear) are .
- Multiclass: SVMs are inherently binary. Multiclass is handled via one-vs-rest or one-vs-one (see Multi-Class Classification).
- Feature scaling: Critical -- SVMs are sensitive to feature magnitudes because the margin depends on distances.
- Probability estimates: Platt scaling fits a sigmoid on top of SVM scores to produce calibrated probabilities, but this is a post-hoc approximation, not a native output of the model.
Common Misconceptions
- "SVMs always find the globally optimal solution." The primal and dual are convex, so the optimizer does find the global minimum of the SVM objective. However, the choice of and kernel parameters still requires tuning.
- "More support vectors means a worse model." Many support vectors may indicate a complex boundary, but it can also reflect noisy data. The number of support vectors alone is not diagnostic.
- "SVMs are obsolete because of deep learning." SVMs remain competitive on tabular data, small datasets, and high-dimensional sparse data (e.g., text with TF-IDF features). They are far from obsolete.
- "The SVM decision boundary always passes through the middle of the two classes." It maximizes the margin, which is the distance to the nearest points. It does not bisect the class means.
Connections to Other Concepts
kernel-methods.md: The dual formulation enables replacing with , mapping data to high-dimensional spaces implicitly. This is the foundation of kernel SVMs.logistic-regression.md: Both learn linear boundaries; the key difference is hinge loss vs. log loss, leading to sparse vs. dense solutions.decision-trees.md: Trees are nonlinear and interpretable but have no margin concept. SVMs are linear (in feature space) with strong generalization theory.naive-bayes.md: A generative classifier with different inductive bias. In high-dimensional text classification, both SVMs and Naive Bayes perform well.
Further Reading
- Vapnik, "The Nature of Statistical Learning Theory" (1995) -- The foundational book on SVMs and VC theory.
- Cortes, Vapnik, "Support-Vector Networks" (1995) -- The original soft-margin SVM paper.
- Burges, "A Tutorial on Support Vector Machines for Pattern Recognition" (1998) -- Highly accessible introduction to SVM theory.
- Scholkopf, Smola, "Learning with Kernels" (2002) -- Comprehensive treatment of SVMs and kernel methods.