One-Line Summary: The fundamental data structures of ML -- representing data as points in high-dimensional space and transformations as matrices.
Prerequisites: Basic algebra, coordinate geometry.
What Are Vectors and Matrices?
Imagine you are describing a house to a buyer. You might list its square footage, number of bedrooms, age, and price. Each of these numbers is a feature, and together they form a vector -- an ordered list of numbers that locates the house as a single point in a four-dimensional "feature space." Now imagine describing ten thousand houses: you stack their feature vectors into rows and get a matrix, a rectangular grid of numbers that encodes an entire dataset in one object.
Formally, a vector is an element of an -dimensional real vector space. A matrix is a rectangular array with rows and columns. In ML the convention is almost universal: each row of a data matrix is one sample and each column is one feature.
How It Works
Vector Spaces and Operations
A vector space over is a set equipped with vector addition and scalar multiplication satisfying closure, associativity, commutativity, and the existence of additive identity and inverses. The canonical example is .
Key operations on vectors:
- Addition:
- Scalar multiplication:
- Dot product:
The dot product deserves special attention. It simultaneously measures (a) the projection of one vector onto another, and (b) how "aligned" two vectors are. When the vectors are orthogonal -- completely unrelated directions. This idea powers everything from cosine similarity in NLP to the normal equations in linear regression.
Matrix Multiplication as Linear Transformation
Matrix multiplication is not just a computational recipe; it is the algebraic encoding of a linear transformation. If , then the map sends vectors in to vectors in . This single idea unifies:
- Rotation and scaling (geometric transformations)
- Projection (dimensionality reduction via PCA)
- Neural network layers (a dense layer computes )
The product where and is defined element-wise as:
This requires the inner dimensions to match and yields .
Transpose and Symmetry
The transpose is obtained by swapping rows and columns: . A matrix is symmetric if . Covariance matrices, Hessians, and kernel matrices are all symmetric, which grants computational advantages such as guaranteed real eigenvalues.
Inverse and Rank
A square matrix is invertible if there exists such that . The inverse exists if and only if , equivalently when has full rank.
The rank of a matrix is the dimension of its column space (equivalently, its row space). For :
When the rank is less than , the matrix is rank-deficient -- some features are linearly dependent. This signals multicollinearity in regression and motivates regularization techniques.
Column Space and Null Space
The column space is the span of 's columns -- the set of all vectors for which has a solution. The null space is the set of all satisfying . Together they satisfy the rank-nullity theorem:
Why It Matters
Nearly every ML algorithm begins by organizing data into a matrix. Linear regression solves . PCA finds eigenvectors of . Neural networks chain matrix multiplications with nonlinearities. Understanding how matrices encode transformations, when systems are solvable, and what rank reveals about data redundancy is prerequisite knowledge for almost everything that follows in ML.
Key Technical Details
- Matrix multiplication is not commutative: in general.
- -- the transpose reverses the order of multiplication.
- The Gram matrix encodes pairwise dot products between features; encodes pairwise dot products between samples.
- Computational cost of naive matrix multiplication of two matrices is ; Strassen's algorithm achieves .
- Sparse matrices (most entries zero) arise in NLP bag-of-words and graph adjacency matrices, enabling specialized storage formats (CSR, CSC) that reduce memory from to .
- An orthogonal matrix satisfies , meaning its columns are orthonormal. Orthogonal matrices preserve lengths and angles, which is why they appear in SVD and QR decomposition.
Common Misconceptions
- "A matrix is just a table of numbers." A matrix is an operator. The same grid of numbers can represent a dataset, a linear map, a covariance structure, or a graph adjacency. Interpreting it correctly depends on context.
- "Inverse always exists for square matrices." Only if the determinant is nonzero. Singular matrices (rank-deficient) have no inverse, which is precisely when the system may have no solution or infinitely many solutions.
- "Higher-dimensional vectors can't be visualized, so intuition fails." Many properties -- orthogonality, projection, span -- generalize perfectly from 2D/3D. Building geometric intuition in low dimensions transfers reliably.
Connections to Other Concepts
matrix-decompositions.md: Eigendecomposition and SVD factor matrices to expose latent structure, rank, and enable compression.derivatives-and-gradients.md: Gradients are vectors; Jacobians and Hessians are matrices. Backpropagation is a sequence of matrix-vector products.norms-and-distance-metrics.md: The L2 norm is defined via the dot product; the Mahalanobis distance uses the inverse covariance matrix.probability-fundamentals.md: Covariance matrices encode the joint variability of random variables.cost-latency-optimization.md: The Hessian matrix determines the curvature of the loss surface and the conditioning of optimization.
Further Reading
- Strang, Introduction to Linear Algebra (2016) -- The gold-standard textbook for building geometric intuition about vector spaces.
- Boyd & Vandenberghe, Introduction to Applied Linear Algebra (2018) -- Focused on applications in data science and ML, freely available online.
- Goodfellow et al., Deep Learning, Chapter 2 (2016) -- A concise review of the linear algebra needed specifically for deep learning.