Vectors and Matrices

One-Line Summary: The fundamental data structures of ML -- representing data as points in high-dimensional space and transformations as matrices.

Prerequisites: Basic algebra, coordinate geometry.

What Are Vectors and Matrices?

Imagine you are describing a house to a buyer. You might list its square footage, number of bedrooms, age, and price. Each of these numbers is a feature, and together they form a vector -- an ordered list of numbers that locates the house as a single point in a four-dimensional "feature space." Now imagine describing ten thousand houses: you stack their feature vectors into rows and get a matrix, a rectangular grid of numbers that encodes an entire dataset in one object.

Formally, a vector $x \in R^{n}$ is an element of an $n$ -dimensional real vector space. A matrix $A \in R^{m \times n}$ is a rectangular array with $m$ rows and $n$ columns. In ML the convention is almost universal: each row of a data matrix $X \in R^{m \times n}$ is one sample and each column is one feature.

How It Works

Vector Spaces and Operations

A vector space over $R$ is a set $V$ equipped with vector addition and scalar multiplication satisfying closure, associativity, commutativity, and the existence of additive identity and inverses. The canonical example is $R^{n}$ .

Key operations on vectors:

Addition: $x + y = (x_{1} + y_{1}, \dots, x_{n} + y_{n})$
Scalar multiplication: $c x = (c x_{1}, \dots, c x_{n})$
Dot product: $x \cdot y = \sum_{i = 1}^{n} x_{i} y_{i} = ∥ x ∥∥ y ∥ cos θ$

The dot product deserves special attention. It simultaneously measures (a) the projection of one vector onto another, and (b) how "aligned" two vectors are. When $x \cdot y = 0$ the vectors are orthogonal -- completely unrelated directions. This idea powers everything from cosine similarity in NLP to the normal equations in linear regression.

Matrix Multiplication as Linear Transformation

Matrix multiplication is not just a computational recipe; it is the algebraic encoding of a linear transformation. If $A \in R^{m \times n}$ , then the map $x \mapsto A x$ sends vectors in $R^{n}$ to vectors in $R^{m}$ . This single idea unifies:

Rotation and scaling (geometric transformations)
Projection (dimensionality reduction via PCA)
Neural network layers (a dense layer computes $h = σ (W x + b)$ )

The product $C = A B$ where $A \in R^{m \times p}$ and $B \in R^{p \times n}$ is defined element-wise as:

$C_{ij} = \sum_{k = 1}^{p} A_{ik} B_{k j}$

This requires the inner dimensions to match and yields $C \in R^{m \times n}$ .

Transpose and Symmetry

The transpose $A^{T}$ is obtained by swapping rows and columns: $(A^{T})_{ij} = A_{j i}$ . A matrix is symmetric if $A = A^{T}$ . Covariance matrices, Hessians, and kernel matrices are all symmetric, which grants computational advantages such as guaranteed real eigenvalues.

Inverse and Rank

A square matrix $A$ is invertible if there exists $A^{- 1}$ such that $A A^{- 1} = A^{- 1} A = I$ . The inverse exists if and only if $det (A) \neq = 0$ , equivalently when $A$ has full rank.

The rank of a matrix is the dimension of its column space (equivalently, its row space). For $A \in R^{m \times n}$ :

$rank (A) \leq min (m, n)$

When the rank is less than $min (m, n)$ , the matrix is rank-deficient -- some features are linearly dependent. This signals multicollinearity in regression and motivates regularization techniques.

Column Space and Null Space

The column space $Col (A)$ is the span of $A$ 's columns -- the set of all vectors $b$ for which $A x = b$ has a solution. The null space $Null (A)$ is the set of all $x$ satisfying $A x = 0$ . Together they satisfy the rank-nullity theorem:

$rank (A) + dim (Null (A)) = n$

Why It Matters

Nearly every ML algorithm begins by organizing data into a matrix. Linear regression solves $X w = y$ . PCA finds eigenvectors of $X^{T} X$ . Neural networks chain matrix multiplications with nonlinearities. Understanding how matrices encode transformations, when systems are solvable, and what rank reveals about data redundancy is prerequisite knowledge for almost everything that follows in ML.

Key Technical Details

Matrix multiplication is not commutative: $A B \neq = B A$ in general.
$(A B)^{T} = B^{T} A^{T}$ -- the transpose reverses the order of multiplication.
The Gram matrix $X^{T} X \in R^{n \times n}$ encodes pairwise dot products between features; $X X^{T} \in R^{m \times m}$ encodes pairwise dot products between samples.
Computational cost of naive matrix multiplication of two $n \times n$ matrices is $O (n^{3})$ ; Strassen's algorithm achieves $O (n^{2.81})$ .
Sparse matrices (most entries zero) arise in NLP bag-of-words and graph adjacency matrices, enabling specialized storage formats (CSR, CSC) that reduce memory from $O (mn)$ to $O (nnz)$ .
An orthogonal matrix $Q$ satisfies $Q^{T} Q = I$ , meaning its columns are orthonormal. Orthogonal matrices preserve lengths and angles, which is why they appear in SVD and QR decomposition.

Common Misconceptions

"A matrix is just a table of numbers." A matrix is an operator. The same grid of numbers can represent a dataset, a linear map, a covariance structure, or a graph adjacency. Interpreting it correctly depends on context.
"Inverse always exists for square matrices." Only if the determinant is nonzero. Singular matrices (rank-deficient) have no inverse, which is precisely when the system $A x = b$ may have no solution or infinitely many solutions.
"Higher-dimensional vectors can't be visualized, so intuition fails." Many properties -- orthogonality, projection, span -- generalize perfectly from 2D/3D. Building geometric intuition in low dimensions transfers reliably.

Connections to Other Concepts

matrix-decompositions.md: Eigendecomposition and SVD factor matrices to expose latent structure, rank, and enable compression.
derivatives-and-gradients.md: Gradients are vectors; Jacobians and Hessians are matrices. Backpropagation is a sequence of matrix-vector products.
norms-and-distance-metrics.md: The L2 norm $∥ x ∥_{2} = x \cdot x$ is defined via the dot product; the Mahalanobis distance uses the inverse covariance matrix.
probability-fundamentals.md: Covariance matrices encode the joint variability of random variables.
cost-latency-optimization.md: The Hessian matrix determines the curvature of the loss surface and the conditioning of optimization.