Derivatives and Gradients

One-Line Summary: The mathematical machinery for measuring how outputs change with inputs -- the foundation of all learning algorithms.

Prerequisites: Vectors and Matrices, single-variable calculus (limits, derivatives).

What Are Derivatives and Gradients?

Imagine you are standing on a hilly landscape and you want to walk downhill as efficiently as possible. At any point you can feel the steepness of the slope in every direction. The gradient is the vector that points in the direction of the steepest uphill climb, and its magnitude tells you how steep that climb is. Walk in the opposite direction and you descend most rapidly. This is, in essence, what every gradient-based learning algorithm does: it evaluates the gradient of a loss function with respect to model parameters and steps in the negative gradient direction.

Formally, the derivative of a scalar function $f : R \to R$ at a point $x$ is:

$f^{'} (x) = lim_{h \to 0} \frac{f ( x + h ) - f ( x )}{h}$

When $f$ depends on multiple variables, we generalize to partial derivatives and collect them into a gradient vector.

How It Works

Partial Derivatives

For $f : R^{n} \to R$ , the partial derivative with respect to $x_{i}$ measures how $f$ changes when only $x_{i}$ varies:

$\frac{\partial f}{\partial x _{i}} = lim_{h \to 0} \frac{f ( x _{1} , \dots , x _{i} + h , \dots , x _{n} ) - f ( x _{1} , \dots , x _{n} )}{h}$

The Gradient Vector

The gradient collects all partial derivatives into a single vector:

$\nabla f (x) = (\frac{\partial f}{\partial x _{1}}, \frac{\partial f}{\partial x _{2}}, \dots, \frac{\partial f}{\partial x _{n}})^{T}$

Two critical properties: (1) $\nabla f$ points in the direction of steepest ascent, and (2) $\nabla f$ is orthogonal to level sets of $f$ (the contours where $f$ is constant).

Directional Derivatives

The rate of change of $f$ in an arbitrary direction $u$ (unit vector) is:

$D_{u} f (x) = \nabla f (x) \cdot u = ∥\nabla f ∥ cos θ$

This is maximized when $u$ is parallel to $\nabla f$ (confirming the steepest ascent interpretation) and zero when $u$ is perpendicular to it.

The Chain Rule

The chain rule is the single most important calculus result for ML. If $y = f (g (x))$ , then:

$\frac{d y}{d x} = \frac{d y}{d g} \cdot \frac{d g}{d x}$

In the multivariate case, if $y = f (g (x))$ where $f : R^{m} \to R^{p}$ and $g : R^{n} \to R^{m}$ :

$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial g} \cdot \frac{\partial g}{\partial x}$

This is a product of Jacobian matrices. Backpropagation in neural networks is precisely the repeated application of this multivariate chain rule, propagating derivatives from the loss backward through each layer.

The Jacobian Matrix

For a vector-valued function $f : R^{n} \to R^{m}$ , the Jacobian $J \in R^{m \times n}$ collects all first-order partial derivatives:

$J_{ij} = \frac{\partial f _{i}}{\partial x _{j}}$

The Jacobian generalizes the gradient to functions with vector outputs. In a neural network, the Jacobian of a layer's output with respect to its input determines how errors propagate.

The Hessian Matrix

For $f : R^{n} \to R$ , the Hessian $H \in R^{n \times n}$ contains all second-order partial derivatives:

$H_{ij} = \frac{\partial ^{2} f}{\partial x _{i} \partial x _{j}}$

The Hessian encodes the curvature of $f$ . If $H$ is positive definite at a critical point, that point is a local minimum. If it has mixed signs, the point is a saddle point. Newton's method uses the Hessian to take curvature-informed steps: $x_{t + 1} = x_{t} - H^{- 1} \nabla f$ .

Automatic Differentiation

Computing derivatives in practice uses one of three approaches:

Symbolic differentiation: Applies calculus rules to produce an exact derivative expression. Can lead to expression swell for complex functions.
Numerical differentiation: Approximates via finite differences $\frac{f ( x + h ) - f ( x )}{h}$ . Simple but suffers from floating-point errors and scales poorly with dimension.
Automatic differentiation (AD): Decomposes the function into elementary operations and applies the chain rule mechanically. Forward mode computes one directional derivative per pass; reverse mode (backpropagation) computes the full gradient in one pass regardless of the number of parameters.

Modern ML frameworks (PyTorch, JAX, TensorFlow) all implement reverse-mode AD, enabling gradients of loss functions with millions of parameters to be computed with roughly the same cost as a single forward evaluation.

Why It Matters

Every parameter update in supervised, unsupervised, and reinforcement learning relies on gradient computation. Without efficient, accurate gradient calculation, training deep networks with billions of parameters would be impossible. The gradient tells the optimizer which direction to move and how far -- it is the feedback signal that enables learning.

Key Technical Details

The gradient $\nabla f$ lives in the same space as $x$ (parameter space), not in the output space.
For $f : R^{n} \to R$ , reverse-mode AD computes the full gradient in $O (1)$ backward passes (relative to the forward pass cost). Forward-mode requires $O (n)$ passes.
The Hessian has $O (n^{2})$ entries, making it impractical to store for large models. Approximations include diagonal Hessians, Hessian-vector products, and the Fisher information matrix.
Clairaut's theorem: If second partials are continuous, $\frac{\partial ^{2} f}{\partial x _{i} \partial x _{j}} = \frac{\partial ^{2} f}{\partial x _{j} \partial x _{i}}$ , so the Hessian is symmetric.
Vanishing and exploding gradients: In deep networks, repeated chain rule applications can cause gradients to shrink toward zero or grow unboundedly. Architectural choices (residual connections, layer normalization) and careful initialization mitigate this.

Common Misconceptions

"The gradient IS the derivative." The gradient is the derivative of a scalar function with respect to a vector. For vector-valued functions, the appropriate generalization is the Jacobian, not the gradient.
"Zero gradient means global minimum." A zero gradient ( $\nabla f = 0$ ) is a necessary condition for any critical point: local minimum, local maximum, or saddle point. The Hessian is needed to distinguish among these.
"Automatic differentiation is just numerical differentiation." AD computes exact derivatives (up to floating-point precision) using the chain rule, not finite differences. It is fundamentally different in both accuracy and computational cost.

Connections to Other Concepts

cost-latency-optimization.md: Gradients provide the update direction; the Hessian informs second-order methods and learning rate selection.
vectors-and-matrices.md: Gradients are vectors; Jacobians and Hessians are matrices. Matrix calculus is the language of multivariable derivatives.
maximum-likelihood-estimation.md: MLE requires differentiating the log-likelihood with respect to parameters and setting the gradient to zero.
information-theory.md: The gradient of cross-entropy loss is directly related to the KL divergence between predicted and true distributions.