Support Vector Machines

One-Line Summary: Finding the maximum-margin hyperplane that separates classes -- elegant geometry with strong theoretical guarantees.

Prerequisites: Linear algebra, optimization (constrained), hyperplanes, convex optimization basics, Lagrange multipliers.

What Is a Support Vector Machine?

Imagine you have red and blue marbles on a table and want to place a ruler between them so that it separates the two colors. Many ruler positions work, but which is best? Intuitively, you want the ruler to be as far as possible from the closest marble on either side. Support Vector Machines (SVMs) formalize this idea: they find the hyperplane that maximizes the margin -- the distance to the nearest data point from either class.

Formally, given labeled training data ${(x_{i}, y_{i})}_{i = 1}^{n}$ with $y_{i} \in {- 1, + 1}$ and $x_{i} \in R^{d}$ , the SVM finds parameters $w$ and $b$ defining the hyperplane $w^{T} x + b = 0$ that maximizes the margin while correctly classifying all training points.

How It Works

Maximum Margin Intuition

The distance from a point $x_{i}$ to the hyperplane $w^{T} x + b = 0$ is $\frac{∣ w ^{T} x _{i} + b ∣}{∥ w ∥}$ . For correctly classified points, $y_{i} (w^{T} x_{i} + b) > 0$ , so the distance is $\frac{y _{i} ( w ^{T} x _{i} + b )}{∥ w ∥}$ . The margin is twice the distance to the closest point:

$margin = \frac{2}{∥ w ∥} (after rescaling so min_{i} y_{i} (w^{T} x_{i} + b) = 1)$

Maximizing $\frac{2}{∥ w ∥}$ is equivalent to minimizing $\frac{1}{2} ∥ w ∥^{2}$ .

Hard-Margin SVM Formulation

When the data is linearly separable, the hard-margin SVM solves:

$min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y_{i} (w^{T} x_{i} + b) \geq 1 \forall i$

This is a convex quadratic program with linear inequality constraints. The constraint $y_{i} (w^{T} x_{i} + b) \geq 1$ ensures every point is on the correct side of the margin.

Support Vectors

The solution depends only on the training points that lie exactly on the margin boundary (where $y_{i} (w^{T} x_{i} + b) = 1$ ). These are the support vectors. All other points could be moved or removed without changing the solution. This sparsity is a key property: the model complexity depends on the number of support vectors, not the total training set size.

Soft-Margin SVM

Real data is rarely perfectly separable. The soft-margin SVM introduces slack variables $ξ_{i} \geq 0$ that allow violations of the margin:

$min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} ξ_{i} subject to y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0$

The parameter $C > 0$ controls the tradeoff:

Large $C$ : Small margin, few violations -- risk of overfitting.
Small $C$ : Large margin, more violations tolerated -- risk of underfitting.

A point with $ξ_{i} = 0$ is correctly classified outside the margin. A point with $0 < ξ_{i} < 1$ is correctly classified but inside the margin. A point with $ξ_{i} > 1$ is misclassified.

Hinge Loss Interpretation

The soft-margin objective can be rewritten without explicit constraints as:

$min_{w, b} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} max (0, 1 - y_{i} (w^{T} x_{i} + b))$

The function $ℓ (z) = max (0, 1 - z)$ is the hinge loss. It is zero when $z \geq 1$ (correct classification with margin) and increases linearly when $z < 1$ . Compare this to logistic regression's cross-entropy loss, which is smooth and never exactly zero. The hinge loss is what produces the sparsity of support vectors: points with $z > 1$ contribute zero gradient.

The Dual Formulation

Using Lagrange multipliers $α_{i} \geq 0$ , the SVM dual is:

$max_{α} \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j} s.t. 0 \leq α_{i} \leq C, \sum_{i} α_{i} y_{i} = 0$

The dual formulation is important for two reasons. First, the data enters only through inner products $x_{i}^{T} x_{j}$ , which enables the kernel trick (see Kernel Methods). Second, $α_{i} > 0$ only for support vectors, making the solution sparse. The decision function becomes:

$f (x) = sign (\sum_{i \in S V} α_{i} y_{i} x_{i}^{T} x + b)$

where $S V$ is the set of support vectors.

SVM vs. Logistic Regression

Both find linear decision boundaries, but they differ in important ways:

Aspect	SVM	Logistic Regression
Loss	Hinge (flat beyond margin)	Log loss (always nonzero)
Probabilities	Not natively	Yes
Sparsity	Only support vectors matter	All points contribute
Optimization	Quadratic program	Unconstrained convex
Kernel extension	Natural via dual	Possible but less standard

In practice, their accuracy is often comparable on linearly separable or near-separable data. Logistic regression is preferred when calibrated probabilities are needed; SVMs are preferred when the kernel trick is beneficial.

VC Dimension and Generalization

SVMs have strong theoretical backing through VC (Vapnik-Chervonenkis) theory. The generalization error is bounded by:

$error \leq \frac{1}{n} (training error + O (\frac{h l o g ( n / h )}{n}))$

where $h$ is the VC dimension. For maximum-margin classifiers, the VC dimension depends on the margin, not the input dimensionality. A large margin implies a smaller effective VC dimension, which implies better generalization. This is why SVMs can work well even in very high dimensions -- the margin-based complexity control is independent of $d$ .

Why It Matters

SVMs were the dominant classification method from the mid-1990s through the late 2000s, before deep learning took over many tasks. They remain excellent for medium-sized datasets, high-dimensional problems (e.g., text, genomics), and situations where theoretical guarantees matter. The kernel trick extends SVMs to nonlinear boundaries without ever computing explicit feature mappings, making them remarkably flexible. SVMs also influenced modern machine learning theory profoundly: concepts like margin, structural risk minimization, and kernel methods originated or were popularized through SVM research.

Key Technical Details

Optimization: The primal is a QP; the dual is also a QP. Efficient solvers include SMO (Sequential Minimal Optimization) and libSVM.
Scaling: Standard SVMs scale $O (n^{2})$ to $O (n^{3})$ in memory and time due to the kernel matrix. For large datasets, linear SVMs (liblinear) are $O (n d)$ .
Multiclass: SVMs are inherently binary. Multiclass is handled via one-vs-rest or one-vs-one (see Multi-Class Classification).
Feature scaling: Critical -- SVMs are sensitive to feature magnitudes because the margin depends on distances.
Probability estimates: Platt scaling fits a sigmoid on top of SVM scores to produce calibrated probabilities, but this is a post-hoc approximation, not a native output of the model.

Common Misconceptions

"SVMs always find the globally optimal solution." The primal and dual are convex, so the optimizer does find the global minimum of the SVM objective. However, the choice of $C$ and kernel parameters still requires tuning.
"More support vectors means a worse model." Many support vectors may indicate a complex boundary, but it can also reflect noisy data. The number of support vectors alone is not diagnostic.
"SVMs are obsolete because of deep learning." SVMs remain competitive on tabular data, small datasets, and high-dimensional sparse data (e.g., text with TF-IDF features). They are far from obsolete.
"The SVM decision boundary always passes through the middle of the two classes." It maximizes the margin, which is the distance to the nearest points. It does not bisect the class means.

Connections to Other Concepts

kernel-methods.md: The dual formulation enables replacing $x_{i}^{T} x_{j}$ with $K (x_{i}, x_{j})$ , mapping data to high-dimensional spaces implicitly. This is the foundation of kernel SVMs.
logistic-regression.md: Both learn linear boundaries; the key difference is hinge loss vs. log loss, leading to sparse vs. dense solutions.
decision-trees.md: Trees are nonlinear and interpretable but have no margin concept. SVMs are linear (in feature space) with strong generalization theory.
naive-bayes.md: A generative classifier with different inductive bias. In high-dimensional text classification, both SVMs and Naive Bayes perform well.