Core Concepts · Module 16·8 min read

Support Vector Machines

Of all the lines you could draw to separate two clouds of points, the SVM picks one specific line: the one that’s as far as possible from both clouds. The geometric insight that made this clean is half the lesson.

The five-bullet version

  • An SVM finds the linear decision boundary that maximizes the margin between two classes.
  • Only the points closest to the boundary (the “support vectors”) matter. The rest could disappear and the line wouldn’t move.
  • A soft-margin variant allows some violations, traded off by a hyperparameter C.
  • The kernel trick lets the same linear math draw curved decision boundaries — by projecting into a high-dimensional space without actually computing the projection.
  • Dominated text and small-dataset ML for ~15 years. Now mostly displaced by gradient boosting and neural nets, but still useful in niche cases.

§ 00 · DRAWING THE LINEWhich line is the right one?

For two well-separated clouds of points, there are infinitely many lines that perfectly separate them. Each picks a different compromise. The question SVMs answer: which line should you pick?

The SVM’s answer: pick the line that is maximally far from the closest points on either side. Push the line out from both clouds equally, until you can’t push further without bumping into a point. That line is the one with the largest marginmargin. The distance from the decision boundary to the nearest training point on either side. SVMs find the line that maximizes this distance..

§ 01 · MAXIMUM MARGINWhy the widest gap is the safest gap

Two intuitions for why maximum margin is the principled choice:

Lab · max-margin classifierTwo classes in 2D · the line that maximizes the gap

Two well-separated clouds. The SVM finds the line that maximizes the gap. Solid line = decision boundary. Dashed lines = margins. Filled band = the no-fly zone. Points on the margin are support vectors — the ones that define the line.

§ 02 · SUPPORT VECTORSWhy most of the data doesn’t matter

Here’s the elegant part. Once the optimal line is found, you can throw away most of the training data without changing the answer. Only the points that sit exactly on the margin — the closest points on each side — determine the line. These are the support vectorssupport vectors. The training points that lie exactly on the SVM's margin. These are the only points that influence the decision boundary; all other points are interior to their class and could be removed without changing the model..

For a typical separable problem, there might be 3–10 support vectors out of thousands of training examples. That’s a major compression: the “trained model” is just a list of support vectors and their weights.

§ 03 · SOFT MARGIN & KERNELSTwo extensions that made SVMs usable

Real data isn’t perfectly separable. Two extensions handle this:

Soft margin. Allow some points to be inside the margin (or even on the wrong side), but pay a penalty proportional to how badly. A hyperparameter C controls the trade-off: large C punishes violations harshly (back toward hard margin), small C lets the boundary be loose. In the third scenario above, the outlier point is handled this way — the SVM accepts the misclassification rather than warping the line.

Kernels. What if the right boundary is curved? SVMs only know lines. The kernel trickkernel trick. A mathematical shortcut that lets a linear method (like SVM) operate in a high-dimensional feature space without ever explicitly computing the high-dimensional features. You only need a 'kernel function' that returns dot products in the implicit space. is the move that handled this. Idea: project your data into a higher-dimensional space where it islinearly separable, then run linear SVM there. The cleverness: you don’t actually do the projection. You replace every dot product in the SVM math with a kernel function that returns what the dot product would have been in the high-dimensional space.

Common kernels:

§ 04 · WHAT 2026 DOES WITH THISNiche but not extinct

SVMs were the dominant classifier from ~1995 to ~2012, especially in text and small-sample regimes. Neural networks and gradient boosting have since taken most of their territory. But SVMs remain useful in specific situations:

Beyond direct use, the SVM’s ideas — margin maximization, the kernel trick, support vectors — show up in surprising places: Gaussian processes, contrastive learning, and metric learning all owe something to the SVM framework.

CHECKYou're classifying images of cats vs dogs with raw pixel features (~50k dimensions). You have 200 labeled examples. Which classifier should you try first?

§ 05 · TAKING THIS FORWARDWhere the ideas reappear

The geometric instinct — find the boundary that pushes both classes as far apart as possible — is one of the most reusable in machine learning. Contrastive learning uses an analogous principle (pull same-class examples close in embedding space, push different-class ones apart). Triplet loss and metric learning are direct descendants. Reading SVMs gives you the mental model that makes those modern methods feel natural.

§ · GOING DEEPERKernels, soft margins, and SVMs in 2026

Two extensions made SVMs practical. The kernel trick(Schölkopf, Smola) replaces the explicit dot product of feature vectors with a kernel function — letting the SVM operate in a very high-dimensional feature space without ever computing it. RBF, polynomial, and string kernels all fit this framework. Thesoft marginformulation (Cortes & Vapnik 1995) introduces slack variables that allow some misclassification, controlled by a regularization parameter C — essential for non-separable data, which is most data.

SVMs were the dominant text classifier from roughly 1995 to 2015. The current state of practice: SVMs lost the deep-learning battle on big data, but they remain the right tool for small, high-dimensional labeled datasets where neural networks would overfit. Hsu, Chang & Lin’s “Practical Guide” (2003) is still a reasonable starting point if you find yourself in that regime.

§ · FURTHER READINGReferences & deeper sources

  1. Cortes, Vapnik (1995). Support-Vector Networks · Machine Learning
  2. Schölkopf et al. (1998). Kernel PCA and De-Noising in Feature Spaces · NeurIPS
  3. Joachims (1999). Making Large-Scale SVM Learning Practical · Advances in Kernel Methods
  4. Hsu, Chang, Lin (2003). A Practical Guide to Support Vector Classification · Technical Report
  5. Schölkopf, Smola (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond · MIT Press

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.