Helpful context:


You can add vectors, scale them, and solve linear systems with them. But you still cannot answer the most basic geometric questions: how long is a vector? what angle do two vectors make? are these two vectors “pointing in the same direction”?

These questions seem obvious in 2D and 3D - you’ve been answering them since high school. But what happens when your vector lives in $\mathbb{R}^{10000}$? A data scientist working on recommendation systems represents each user as a 10,000-dimensional vector of ratings. How do you measure which users are “similar”? A physicist represents a quantum state as a vector in an infinite-dimensional function space. What does “orthogonal” mean there? An image recognition model compares feature vectors from two photos. What’s the right notion of “distance”?

The answer to all of these questions traces back to one structure: the inner product. This post builds it from scratch - starting with the familiar dot product, extracting its essential properties, and extending them to any vector space.


Section 1: The Dot Product in $\mathbb{R}^n$

Definition and Basic Properties

For vectors $\mathbf{u} = (u_1, u_2, \ldots, u_n)$ and $\mathbf{v} = (v_1, v_2, \ldots, v_n)$ in $\mathbb{R}^n$, the dot product (standard inner product) is:

$$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n$$

In matrix notation, treating $\mathbf{u}$ and $\mathbf{v}$ as column vectors: $\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v}$.

The dot product has three essential properties that we will later abstract into axioms:

Commutativity (Symmetry): $\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}$. This is immediate from the definition.

Bilinearity: The dot product is linear in each argument separately. That means:

$$(\alpha\mathbf{u} + \beta\mathbf{w}) \cdot \mathbf{v} = \alpha(\mathbf{u} \cdot \mathbf{v}) + \beta(\mathbf{w} \cdot \mathbf{v})$$
and the same holds in the second argument by symmetry. This is what makes the dot product a "product" in the algebraic sense.

Positive definiteness: $\mathbf{v} \cdot \mathbf{v} = \sum v_i^2 \geq 0$, with equality if and only if every $v_i = 0$, i.e., $\mathbf{v} = \mathbf{0}$.

The Geometric Formula

Here is the connection between algebra and geometry that makes everything work. For nonzero vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ with angle $\theta$ between them:

$$\mathbf{u} \cdot \mathbf{v} = \lVert\mathbf{u}\lVert \lVert\mathbf{v}\lVert \cos\theta$$

where $\lVert\mathbf{u}\lVert = \sqrt{\mathbf{u} \cdot \mathbf{u}}$ is the Euclidean length of $\mathbf{u}$.

Proof sketch using the law of cosines. In the triangle formed by $\mathbf{u}$, $\mathbf{v}$, and $\mathbf{v} - \mathbf{u}$, the law of cosines gives:

$$\lVert\mathbf{v} - \mathbf{u}\lVert^2 = \lVert\mathbf{u}\lVert^2 + \lVert\mathbf{v}\lVert^2 - 2\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert\cos\theta$$

Now expand the left side using the dot product:

$$\lVert\mathbf{v} - \mathbf{u}\lVert^2 = (\mathbf{v} - \mathbf{u}) \cdot (\mathbf{v} - \mathbf{u}) = \lVert\mathbf{v}\lVert^2 - 2\mathbf{u} \cdot \mathbf{v} + \lVert\mathbf{u}\lVert^2$$

Equating the two expressions and cancelling $\lVert\mathbf{u}\lVert^2 + \lVert\mathbf{v}\lVert^2$ from both sides:

$$-2\mathbf{u} \cdot \mathbf{v} = -2\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert\cos\theta$$

Dividing by $-2$ gives the formula. $\square$

This formula gives us an immediate geometric interpretation of the sign of the dot product:

  • $\mathbf{u} \cdot \mathbf{v} > 0$: the angle $\theta < 90°$ - the vectors point in roughly the same direction.
  • $\mathbf{u} \cdot \mathbf{v} = 0$: the angle $\theta = 90°$ - the vectors are orthogonal (perpendicular).
  • $\mathbf{u} \cdot \mathbf{v} < 0$: the angle $\theta > 90°$ - the vectors point in roughly opposite directions.

Worked Example

Let $\mathbf{u} = (1, 2, 3)$ and $\mathbf{v} = (4, -1, 0)$.

Dot product:

$$\mathbf{u} \cdot \mathbf{v} = (1)(4) + (2)(-1) + (3)(0) = 4 - 2 + 0 = 2$$

Norms:

$$\lVert\mathbf{u}\lVert = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14}, \qquad \lVert\mathbf{v}\lVert = \sqrt{4^2 + (-1)^2 + 0^2} = \sqrt{17}$$

Angle:

$$\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert} = \frac{2}{\sqrt{14} \cdot \sqrt{17}} = \frac{2}{\sqrt{238}} \approx 0.1297$$
$$\theta = \arccos(0.1297) \approx 82.5°$$

These vectors are nearly but not quite orthogonal.

Cauchy-Schwarz inequality. For any vectors $\mathbf{u}, \mathbf{v}$ in $\mathbb{R}^n$: $$|\mathbf{u} \cdot \mathbf{v}| \leq \lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert$$ Proof. For any real $t$, the squared norm $\lVert\mathbf{u} - t\mathbf{v}\lVert^2 \geq 0$. Expanding: $$\lVert\mathbf{u}\lVert^2 - 2t(\mathbf{u} \cdot \mathbf{v}) + t^2\lVert\mathbf{v}\lVert^2 \geq 0$$ This is a quadratic in $t$ that is always non-negative, so its discriminant must be $\leq 0$: $$4(\mathbf{u} \cdot \mathbf{v})^2 - 4\lVert\mathbf{u}\lVert^2\lVert\mathbf{v}\lVert^2 \leq 0$$ Rearranging gives $|\mathbf{u} \cdot \mathbf{v}| \leq \lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert$. $\square$

Discomfort check. The formula $\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert}$ requires $|\cos\theta| \leq 1$, which in turn requires $|\mathbf{u} \cdot \mathbf{v}| \leq \lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert$. The box above proves exactly this. Section 3 revisits the proof in the abstract inner product setting with full detail on the equality condition.


Section 2: Norms - Measuring Size

The Euclidean norm $\lVert\mathbf{v}\lVert_2 = \sqrt{\mathbf{v} \cdot \mathbf{v}}$ is natural and familiar. But it is not the only useful way to measure the size of a vector. Different norms encode different geometric intuitions and lead to different behaviors in applications.

The Lp Family

For $\mathbf{v} \in \mathbb{R}^n$ and $1 \leq p < \infty$, the $L^p$ norm is:

$$\lVert\mathbf{v}\lVert_p = \left(\sum_{i=1}^n |v_i|^p\right)^{1/p}$$

The limiting case $p \to \infty$ gives the $L^\infty$ norm (infinity norm):

$$\lVert\mathbf{v}\lVert_\infty = \max_{1 \leq i \leq n} |v_i|$$

The three most important cases:

  • $L^1$ (Manhattan norm): $\lVert\mathbf{v}\lVert_1 = \sum_{i=1}^n |v_i|$. The distance you travel on a grid city - only north/south/east/west moves, no diagonals.
  • $L^2$ (Euclidean norm): $\lVert\mathbf{v}\lVert_2 = \sqrt{\sum_{i=1}^n v_i^2}$. Straight-line distance. The only $L^p$ norm that comes from an inner product.
  • $L^\infty$ (Chebyshev norm): $\lVert\mathbf{v}\lVert_\infty = \max_i |v_i|$. The size of the largest component. Controls the worst-case error.

For $\mathbf{v} = (3, -4)$:

$$\lVert\mathbf{v}\lVert_1 = 7, \qquad \lVert\mathbf{v}\lVert_2 = 5, \qquad \lVert\mathbf{v}\lVert_\infty = 4$$

Geometry of Unit Balls

The unit ball in norm $\lVert\cdot\lVert$ is the set of all vectors with $\lVert\mathbf{v}\lVert \leq 1$. Its shape reveals the geometry of the norm:

L¹ (p=1) Diamond corners on axes L² (p=2) Circle rotationally symmetric L∞ (p=∞) Square corners off axes

The $L^1$ ball has corners exactly on the coordinate axes. The $L^\infty$ ball has corners off the axes. The $L^2$ ball is the only one with perfect rotational symmetry.

Why Different Norms Matter

$L^2$ promotes smoothness. The $L^2$ unit ball is smooth - it has no preferred directions. When you minimize a loss subject to $\lVert\mathbf{w}\lVert_2 \leq r$ (ridge regression), the optimal solution has all weights shrunk proportionally toward zero, but none are exactly zero.

$L^1$ promotes sparsity. The $L^1$ unit ball has corners on the coordinate axes. When an optimization contour first touches the $L^1$ ball, it often hits a corner - a corner where one or more coordinates are exactly zero. This is why $L^1$ regularization (LASSO) produces sparse solutions: it actively pushes some weights to exactly zero.

$L^\infty$ controls the worst-case component. Useful when you care about the maximum error across any single coordinate.

Norm Axioms

What makes something a legitimate “norm”? Three axioms characterize the concept:

  1. Non-negativity and definiteness: $\lVert\mathbf{v}\lVert \geq 0$ for all $\mathbf{v}$, with $\lVert\mathbf{v}\lVert = 0$ if and only if $\mathbf{v} = \mathbf{0}$.
  2. Absolute homogeneity: $\lVert c\mathbf{v}\lVert = |c|\lVert\mathbf{v}\lVert$ for all scalars $c$.
  3. Triangle inequality: $\lVert\mathbf{u} + \mathbf{v}\lVert \leq \lVert\mathbf{u}\lVert + \lVert\mathbf{v}\lVert$.

Every $L^p$ norm for $p \geq 1$ satisfies these axioms. The triangle inequality for general $p$ follows from Minkowski’s inequality, which is itself proved using Holder’s inequality - the $L^p$ generalization of Cauchy-Schwarz.

Discomfort check. Why does $p$ need to be $\geq 1$? Consider the “L^{0.5} quasi-norm” in $\mathbb{R}^2$: $\lVert(1,0)\lVert_{0.5} = 1$, $\lVert(0,1)\lVert_{0.5} = 1$, but $\lVert(1,1)\lVert_{0.5} = (1 + 1)^2 = 4 > 2$. The triangle inequality is violated. For $p < 1$, the unit “ball” is not convex, and the triangle inequality fails. The requirement $p \geq 1$ is precisely the condition that makes the unit ball convex.


Section 3: The Cauchy-Schwarz Inequality

Statement

For any vectors $\mathbf{u}, \mathbf{v}$ in an inner product space:

$$|\mathbf{u} \cdot \mathbf{v}| \leq \lVert\mathbf{u}\lVert \lVert\mathbf{v}\lVert$$

Equality holds if and only if $\mathbf{u}$ and $\mathbf{v}$ are proportional (i.e., one is a scalar multiple of the other).

This is the most important inequality in linear algebra. It guarantees that $\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert}$ always lies in $[-1, 1]$, so the angle formula is well-defined. It implies the triangle inequality. And it underpins enormous swaths of analysis and probability.

The Proof

Proof. If $\mathbf{v} = \mathbf{0}$, both sides are zero and the inequality holds trivially.

Assume $\mathbf{v} \neq \mathbf{0}$. For any real number $t$, the vector $\mathbf{u} - t\mathbf{v}$ has a non-negative squared norm:

$$\lVert\mathbf{u} - t\mathbf{v}\lVert^2 \geq 0$$

Expand the left side using bilinearity:

$$\lVert\mathbf{u}\lVert^2 - 2t(\mathbf{u} \cdot \mathbf{v}) + t^2\lVert\mathbf{v}\lVert^2 \geq 0$$

This is a quadratic function of $t$: $f(t) = \lVert\mathbf{v}\lVert^2 \cdot t^2 - 2(\mathbf{u} \cdot \mathbf{v}) \cdot t + \lVert\mathbf{u}\lVert^2 \geq 0$ for all $t$.

A quadratic that is always $\geq 0$ has non-positive discriminant:

$$\Delta = 4(\mathbf{u} \cdot \mathbf{v})^2 - 4\lVert\mathbf{v}\lVert^2 \lVert\mathbf{u}\lVert^2 \leq 0$$

Rearranging: $(\mathbf{u} \cdot \mathbf{v})^2 \leq \lVert\mathbf{u}\lVert^2 \lVert\mathbf{v}\lVert^2$. Taking square roots:

$$|\mathbf{u} \cdot \mathbf{v}| \leq \lVert\mathbf{u}\lVert \lVert\mathbf{v}\lVert$$

Equality condition. The discriminant equals zero exactly when $f(t)$ touches zero at its minimum, i.e., when $\lVert\mathbf{u} - t\mathbf{v}\lVert^2 = 0$ for $t = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{v}\lVert^2}$. This means $\mathbf{u} = t\mathbf{v}$: the vectors are proportional. $\square$

Discomfort check. The key step was choosing the right value of $t$ - the minimizer of the quadratic. In practice, you can equivalently just substitute $t = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{v}\lVert^2}$ directly (this is the $t$ that minimizes $f(t)$, found by setting $f'(t) = 0$). The discriminant condition is equivalent but packages the same computation more neatly.

Consequences

The triangle inequality for $L^2$. From Cauchy-Schwarz:

$$\begin{aligned} \lVert\mathbf{u} + \mathbf{v}\lVert^2 &= (\mathbf{u} + \mathbf{v}) \cdot (\mathbf{u} + \mathbf{v}) \\ &= \lVert\mathbf{u}\lVert^2 + 2(\mathbf{u} \cdot \mathbf{v}) + \lVert\mathbf{v}\lVert^2 \\ &\leq \lVert\mathbf{u}\lVert^2 + 2|\mathbf{u} \cdot \mathbf{v}| + \lVert\mathbf{v}\lVert^2 \\ &\leq \lVert\mathbf{u}\lVert^2 + 2\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert + \lVert\mathbf{v}\lVert^2 \\ &= (\lVert\mathbf{u}\lVert + \lVert\mathbf{v}\lVert)^2 \end{aligned}$$

Taking square roots: $\lVert\mathbf{u} + \mathbf{v}\lVert \leq \lVert\mathbf{u}\lVert + \lVert\mathbf{v}\lVert$. The triangle inequality follows from Cauchy-Schwarz - it is not assumed but derived.

Cosine similarity in machine learning. Define:

$$\text{cosine\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert}$$

By Cauchy-Schwarz, this always lies in $[-1, 1]$. It equals $\cos\theta$ - it measures directional agreement while ignoring magnitude. Word vectors in NLP are compared by cosine similarity: the direction encodes meaning, while the magnitude reflects frequency. “King” and “Queen” have high cosine similarity. “King” and “Banana” do not.


Section 4: Abstract Inner Products

The Leap to Abstraction

The dot product on $\mathbb{R}^n$ was defined by a formula. But the three properties - symmetry, bilinearity, positive definiteness - are more fundamental than the formula. We can use them as axioms and build an entire theory on top, valid in any vector space.

Definition. An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:

  1. Bilinearity: $\langle \alpha\mathbf{u} + \beta\mathbf{w}, \mathbf{v} \rangle = \alpha\langle\mathbf{u},\mathbf{v}\rangle + \beta\langle\mathbf{w},\mathbf{v}\rangle$ for all $\alpha, \beta \in \mathbb{R}$ (and the same in the second argument by symmetry).
  2. Symmetry: $\langle\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{v},\mathbf{u}\rangle$ for all $\mathbf{u}, \mathbf{v} \in V$.
  3. Positive definiteness: $\langle\mathbf{v},\mathbf{v}\rangle \geq 0$ for all $\mathbf{v}$, with $\langle\mathbf{v},\mathbf{v}\rangle = 0$ if and only if $\mathbf{v} = \mathbf{0}$.

A vector space equipped with an inner product is called an inner product space (or a pre-Hilbert space).

Discomfort check. Over $\mathbb{C}$, we must modify axiom 2 to conjugate symmetry: $\langle\mathbf{u},\mathbf{v}\rangle = \overline{\langle\mathbf{v},\mathbf{u}\rangle}$. This is forced by positive definiteness: we need $\langle\mathbf{v},\mathbf{v}\rangle \in \mathbb{R}$ for all $\mathbf{v}$, which plain symmetry would violate for complex inner products. The standard inner product on $\mathbb{C}^n$ is $\langle\mathbf{u},\mathbf{v}\rangle = \sum_i \bar{u}_i v_i$, and it is conjugate-linear in the first argument. This is a convention issue - physicists' Dirac notation $\langle\phi|\psi\rangle$ is linear in the second argument. This post works over $\mathbb{R}$ for simplicity.

Example: Functions as Vectors

Let $V = C[0,1]$, the vector space of continuous functions on $[0,1]$. Define:

$$\langle f, g \rangle = \int_0^1 f(x)g(x)dx$$

Let us verify all three axioms.

Bilinearity: $\langle \alpha f + \beta h, g \rangle = \int_0^1 (\alpha f(x) + \beta h(x))g(x)dx = \alpha\int_0^1 f(x)g(x)dx + \beta\int_0^1 h(x)g(x)dx = \alpha\langle f, g\rangle + \beta\langle h, g\rangle$. Check.

Symmetry: $\langle f, g\rangle = \int_0^1 f(x)g(x)dx = \int_0^1 g(x)f(x)dx = \langle g, f\rangle$. Check.

Positive definiteness: $\langle f, f\rangle = \int_0^1 f(x)^2dx \geq 0$. When does equality hold? Since $f$ is continuous and $f(x)^2 \geq 0$, the integral is zero only if $f(x) = 0$ for all $x$, i.e., $f \equiv 0$. Check.

So continuous functions are vectors with a legitimate inner product. Two functions $f$ and $g$ are orthogonal when $\int_0^1 f(x)g(x)dx = 0$. The functions $\sin(m\pi x)$ and $\cos(n\pi x)$ are orthogonal in this sense - this is the foundation of Fourier analysis.

The induced norm is $\lVert f\lVert = \sqrt{\int_0^1 f(x)^2dx}$, the $L^2$ norm for functions.

Example: The Frobenius Inner Product for Matrices

Let $V = \mathbb{R}^{m \times n}$, the vector space of $m \times n$ matrices. Define:

$$\langle A, B \rangle = \text{tr}(A^T B) = \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij}$$

This is essentially the dot product, but applied entry-by-entry to the matrix. The induced norm is the Frobenius norm:

$$\lVert A\lVert_F = \sqrt{\text{tr}(A^T A)} = \sqrt{\sum_{i,j} A_{ij}^2}$$

The Frobenius norm is to matrices what the $L^2$ norm is to vectors. It appears in nuclear norm minimization (matrix completion), in measuring the convergence of iterative algorithms, and as the objective in low-rank approximation (where the best rank-$k$ approximation to $A$ is given by the truncated SVD - and the error is measured in Frobenius norm).

Example: Weighted Inner Product

Fix a positive definite matrix $W \in \mathbb{R}^{n \times n}$ (all eigenvalues positive). Define:

$$\langle \mathbf{u}, \mathbf{v} \rangle_W = \mathbf{u}^T W \mathbf{v}$$

This is a valid inner product (the positive definiteness of $W$ ensures axiom 3). Different choices of $W$ weight coordinates differently. Setting $W = \Sigma^{-1}$ (inverse covariance matrix) gives the Mahalanobis distance: it measures distance in “standard deviation units,” accounting for correlations between features.

Deriving the Norm and Orthogonality

Once you have an inner product, you automatically get:

A norm: $\lVert\mathbf{v}\lVert = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$.

A notion of orthogonality: $\mathbf{u} \perp \mathbf{v}$ means $\langle\mathbf{u},\mathbf{v}\rangle = 0$.

A notion of angle: $\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert}$ (Cauchy-Schwarz guarantees this lies in $[-1,1]$).

The entire geometric vocabulary of Euclidean space - length, angle, perpendicularity, distance - extends to any inner product space, including infinite-dimensional function spaces.


Section 5: Orthogonality and Its Power

Pythagorean Theorem in Any Inner Product Space

If $\langle\mathbf{u},\mathbf{v}\rangle = 0$, then:

$$\lVert\mathbf{u} + \mathbf{v}\lVert^2 = \langle\mathbf{u} + \mathbf{v}, \mathbf{u} + \mathbf{v}\rangle = \lVert\mathbf{u}\lVert^2 + 2\langle\mathbf{u},\mathbf{v}\rangle + \lVert\mathbf{v}\lVert^2 = \lVert\mathbf{u}\lVert^2 + \lVert\mathbf{v}\lVert^2$$

This is the Pythagorean theorem - valid in any inner product space, for vectors in $\mathbb{R}^{10000}$ or functions in $L^2[0,1]$.

For $k$ mutually orthogonal vectors $\mathbf{v_1}, \ldots, \mathbf{v_k}$ (meaning $\langle\mathbf{v_i},\mathbf{v_j}\rangle = 0$ for $i \neq j$), the theorem extends:

$$\left\lVert\sum_{i=1}^k \mathbf{v_i}\right\lVert^2 = \sum_{i=1}^k \lVert\mathbf{v_i}\lVert^2$$

Orthonormal Sets and Bases

A set $\{\mathbf{e_1}, \mathbf{e_2}, \ldots, \mathbf{e_k}\}$ is orthonormal if:

  • Each vector is a unit vector: $\lVert\mathbf{e_i}\lVert = 1$.
  • Any two distinct vectors are orthogonal: $\langle\mathbf{e_i},\mathbf{e_j}\rangle = 0$ for $i \neq j$.

This can be written compactly as $\langle\mathbf{e_i},\mathbf{e_j}\rangle = \delta_{ij}$ where $\delta_{ij}$ is the Kronecker delta ($1$ if $i = j$, $0$ otherwise).

Orthonormal bases make coordinates trivially easy to compute. If $\{\mathbf{e_1}, \ldots, \mathbf{e_n}\}$ is an orthonormal basis of $V$, then for any vector $\mathbf{v} \in V$:

$$\mathbf{v} = \sum_{i=1}^n \langle\mathbf{v}, \mathbf{e_i}\rangle \mathbf{e_i}$$

The $i$-th coordinate is just $c_i = \langle\mathbf{v}, \mathbf{e_i}\rangle$. No solving linear systems required - just take inner products with each basis vector.

Discomfort check. Why does this formula work? Each $c_i = \langle\mathbf{v}, \mathbf{e_i}\rangle$ extracts the component of $\mathbf{v}$ in the direction $\mathbf{e_i}$ - it “projects” $\mathbf{v}$ onto $\mathbf{e_i}$. The orthogonality of the basis vectors ensures these projections don’t interfere with each other. If the basis were not orthonormal, you’d need to solve a linear system (the Gram matrix) to find the coordinates. Orthonormality is the “free lunch” of linear algebra.

Why Orthogonal Sets Are Linearly Independent

Claim. Any set of nonzero pairwise orthogonal vectors is linearly independent.

Proof. Suppose $\sum c_i \mathbf{v_i} = \mathbf{0}$ with $\langle\mathbf{v_i},\mathbf{v_j}\rangle = 0$ for $i \neq j$. Take the inner product of both sides with $\mathbf{v_k}$:

$$\left\langle \mathbf{v_k}, \sum_i c_i \mathbf{v_i} \right\rangle = c_k \lVert\mathbf{v_k}\lVert^2 = 0$$

Since $\mathbf{v_k} \neq \mathbf{0}$, we have $\lVert\mathbf{v_k}\lVert^2 > 0$, so $c_k = 0$. This holds for every $k$. $\square$

This is a striking result: just knowing vectors are mutually orthogonal immediately gives linear independence, without any calculation.

Preview of Gram-Schmidt

Given any basis $\{\mathbf{v_1}, \ldots, \mathbf{v_n}\}$ (not necessarily orthonormal), the Gram-Schmidt process constructs an orthonormal basis $\{\mathbf{e_1}, \ldots, \mathbf{e_n}\}$ spanning the same space. The idea: process the vectors one at a time, subtracting the component of each new vector that lies along the previously processed (orthogonal) directions, then normalize. This is the subject of the next post.


Section 6: The Gram Matrix

Definition

Given a collection of vectors $\mathbf{v_1}, \ldots, \mathbf{v_k}$ in an inner product space, the Gram matrix $G$ is the $k \times k$ matrix with entries:

$$G_{ij} = \langle\mathbf{v_i}, \mathbf{v_j}\rangle$$

For vectors in $\mathbb{R}^n$: let $V$ be the $n \times k$ matrix whose columns are $\mathbf{v_1}, \ldots, \mathbf{v_k}$. Then:

$$G = V^T V$$

because $(V^T V)_{ij} = \mathbf{v_i}^T \mathbf{v_j} = \langle\mathbf{v_i}, \mathbf{v_j}\rangle$.

Positive Semidefiniteness

The Gram matrix is always positive semidefinite: for any vector $\mathbf{c} \in \mathbb{R}^k$:

$$\mathbf{c}^T G \mathbf{c} = \mathbf{c}^T V^T V \mathbf{c} = \lVert V\mathbf{c}\lVert^2 \geq 0$$

When is $G$ positive definite (all eigenvalues strictly positive)? Exactly when $\mathbf{c}^T G \mathbf{c} = 0 \implies \mathbf{c} = \mathbf{0}$, which means $V\mathbf{c} = \mathbf{0} \implies \mathbf{c} = \mathbf{0}$, i.e., $V$ has trivial null space, i.e., the vectors $\mathbf{v_1}, \ldots, \mathbf{v_k}$ are linearly independent.

Key theorem. The Gram matrix $G$ is:

  • Positive semidefinite always.
  • Positive definite if and only if $\mathbf{v_1}, \ldots, \mathbf{v_k}$ are linearly independent.

The Gram Matrix Appears Everywhere

Least squares normal equations. Given an overdetermined system $A\mathbf{x} \approx \mathbf{b}$, the least squares solution satisfies $A^T A \mathbf{x} = A^T \mathbf{b}$. The matrix $A^T A$ is exactly the Gram matrix of the columns of $A$. The normal equations say: project $\mathbf{b}$ onto the column space of $A$, and find the coefficients.

Kernel methods. In support vector machines and Gaussian processes, the kernel matrix $K_{ij} = K(\mathbf{x_i}, \mathbf{x_j})$ is a Gram matrix in some feature space. The positive semidefiniteness of the kernel matrix is not an assumption - it follows automatically from the Gram structure.

Covariance matrices. The sample covariance matrix of centered data is $\frac{1}{n-1} X^T X$, a Gram matrix. Its eigenvalues are the variances along the principal components (PCA directions).


Section 7: Applications

Word Embeddings and Semantic Similarity

Modern NLP represents each word as a vector in $\mathbb{R}^d$ (typically $d = 100$ to $d = 1000$). These vectors are trained so that semantically related words have high dot products (or cosine similarities). The intuition:

$$\text{sim}(\text{"King"}, \text{"Queen"}) = \frac{\mathbf{w}_{\text{King}} \cdot \mathbf{w}_{\text{Queen}}}{\lVert\mathbf{w}_{\text{King}}\lVert\lVert\mathbf{w}_{\text{Queen}}\lVert} \gg \text{sim}(\text{"King"}, \text{"Banana"})$$

The famous “King - Man + Woman $\approx$ Queen” arithmetic is really arithmetic in the inner product space: the inner product structure encodes semantic relationships as geometric relationships.

Least Squares as Orthogonal Projection

When you fit a linear model $A\mathbf{x} \approx \mathbf{b}$, the least squares solution $\hat{\mathbf{x}}$ produces the orthogonal projection of $\mathbf{b}$ onto the column space of $A$. The residual $\mathbf{b} - A\hat{\mathbf{x}}$ is orthogonal to every column of $A$ - this is precisely the normal equations.

The inner product formalizes what “projection” means: projecting $\mathbf{b}$ onto a vector $\mathbf{a}$ gives $\frac{\langle\mathbf{a},\mathbf{b}\rangle}{\langle\mathbf{a},\mathbf{a}\rangle}\mathbf{a}$. The numerator is the dot product; the denominator is the squared norm. Everything is inner products.

Principal Component Analysis

The covariance matrix $\Sigma$ of a dataset is a Gram matrix (after centering). Its eigenvectors are the principal components - the directions in which the data varies the most. The eigenvalue $\lambda_i$ is the variance along eigenvector $\mathbf{e_i}$.

The inner product structure is essential: we want directions that are orthogonal (uncorrelated), and “variance along direction $\mathbf{u}$” is $\mathbf{u}^T \Sigma \mathbf{u}$, a weighted inner product.

Signal Processing: Correlation of Signals

Two signals $f, g : [0, T] \to \mathbb{R}$ are “correlated” precisely when their inner product $\langle f, g\rangle = \int_0^T f(t)g(t)dt$ is large. The cross-correlation $R_{fg}(\tau) = \int_{-\infty}^\infty f(t)g(t+\tau)dt$ slides one signal past the other and measures the inner product at each offset - used to detect signals in noise.

In Fourier analysis, the coefficients $\hat{f}(n) = \int_0^{2\pi} f(x)e^{-inx}dx$ are inner products of $f$ with the complex exponentials $e^{inx}$. Fourier analysis is just orthogonal decomposition in an infinite-dimensional inner product space.

Kernel Methods: Inner Products in Feature Space

In support vector machines, the key observation is: the decision boundary depends on the data only through inner products $\langle\mathbf{x_i},\mathbf{x_j}\rangle$. If we map the data to a high-dimensional feature space $\phi : \mathbb{R}^d \to \mathbb{R}^D$ (possibly $D = \infty$), we can compute inner products in feature space via a kernel function:

$$K(\mathbf{x}, \mathbf{y}) = \langle\phi(\mathbf{x}), \phi(\mathbf{y})\rangle$$

The kernel function computes the inner product in feature space without ever explicitly computing the feature map $\phi$. The RBF kernel $K(\mathbf{x},\mathbf{y}) = \exp\left(-\frac{\lVert\mathbf{x}-\mathbf{y}\lVert^2}{2\sigma^2}\right)$ corresponds to an infinite-dimensional feature space. Kernel methods are inner products in disguise.


Section 8: The Rigorous Underpinning

Not Every Norm Comes from an Inner Product

Every inner product gives a norm via $\lVert\mathbf{v}\lVert = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$. But the converse is false: many norms cannot be “de-polarized” back into an inner product.

Jordan-von Neumann theorem. A norm $\lVert\cdot\lVert$ on a vector space comes from an inner product if and only if it satisfies the parallelogram law:

$$\lVert\mathbf{u} + \mathbf{v}\lVert^2 + \lVert\mathbf{u} - \mathbf{v}\lVert^2 = 2\lVert\mathbf{u}\lVert^2 + 2\lVert\mathbf{v}\lVert^2 \quad \text{for all } \mathbf{u}, \mathbf{v}$$

Geometrically: the sum of the squares of the diagonals of a parallelogram equals twice the sum of the squares of its sides.

For the $L^p$ norms, only $p = 2$ satisfies the parallelogram law. Let us verify the $L^1$ norm fails in $\mathbb{R}^2$ with $\mathbf{u} = (1,0)$, $\mathbf{v} = (0,1)$:

$$\lVert\mathbf{u} + \mathbf{v}\lVert_1^2 + \lVert\mathbf{u} - \mathbf{v}\lVert_1^2 = \lVert(1,1)\lVert_1^2 + \lVert(1,-1)\lVert_1^2 = 4 + 4 = 8$$
$$2\lVert\mathbf{u}\lVert_1^2 + 2\lVert\mathbf{v}\lVert_1^2 = 2(1)^2 + 2(1)^2 = 4$$

Since $8 \neq 4$, the $L^1$ norm does not come from any inner product. This is why LASSO behaves differently from ridge regression at a fundamental structural level - it is working in a space without a natural inner product.

If the parallelogram law holds, the inner product can be recovered from the norm via the polarization identity:

$$\langle\mathbf{u},\mathbf{v}\rangle = \frac{1}{4}\left(\lVert\mathbf{u}+\mathbf{v}\lVert^2 - \lVert\mathbf{u}-\mathbf{v}\lVert^2\right)$$

Hilbert Spaces

An inner product space is a vector space with an inner product. A Hilbert space is an inner product space that is also complete: every Cauchy sequence converges to a point in the space.

Completeness is the difference between $\mathbb{Q}$ and $\mathbb{R}$: in $\mathbb{Q}$, the sequence $3, 3.1, 3.14, 3.141, \ldots$ is Cauchy but does not converge in $\mathbb{Q}$. In $\mathbb{R}$, it converges to $\pi$.

For functional analysis and quantum mechanics, the natural Hilbert space is $L^2([a,b])$ - the space of square-integrable functions with inner product $\langle f,g\rangle = \int_a^b f(x)g(x)dx$. (Technically, we identify functions that agree almost everywhere; this is why we need Lebesgue integration.) This is an infinite-dimensional Hilbert space, and it is complete.

All finite-dimensional inner product spaces are automatically Hilbert spaces (finite-dimensional normed spaces are always complete).

Orthogonal Complement and Direct Sum

For any subspace $U$ of an inner product space $V$, define the orthogonal complement:

$$U^\perp = \\\{\mathbf{v} \in V : \langle\mathbf{u},\mathbf{v}\rangle = 0 \text{ for all } \mathbf{u} \in U\\\}$$

$U^\perp$ is always a subspace. In a Hilbert space (and in finite dimensions), we have the orthogonal decomposition:

$$V = U \oplus U^\perp$$

Every vector $\mathbf{v} \in V$ decomposes uniquely as $\mathbf{v} = \mathbf{u} + \mathbf{w}$ with $\mathbf{u} \in U$ and $\mathbf{w} \in U^\perp$. The vector $\mathbf{u}$ is the orthogonal projection of $\mathbf{v}$ onto $U$. This is the fundamental theorem underlying least squares, Fourier series, and every projection-based algorithm in applied mathematics.

The Riesz Representation Theorem

In a Hilbert space $\mathcal{H}$, every continuous linear functional $\phi : \mathcal{H} \to \mathbb{R}$ has the form:

$$\phi(\mathbf{v}) = \langle\mathbf{u},\mathbf{v}\rangle$$

for a unique vector $\mathbf{u} \in \mathcal{H}$.

This theorem is remarkable: it says that inner products are the “canonical” continuous linear functionals. In quantum mechanics, every observable is a linear functional on a Hilbert space (the state space), and the Riesz theorem guarantees it is an inner product. This is the mathematical foundation of the bra-ket formalism: the bra $\langle\phi|$ is the linear functional that maps a state $|\psi\rangle$ to $\langle\phi|\psi\rangle$.

In finite dimensions, the theorem is elementary (just linear algebra). In infinite dimensions, it requires the completeness of the Hilbert space - the “Cauchy sequences converge” property - and its proof uses the orthogonal decomposition theorem above.

Discomfort check. Why is completeness necessary in infinite dimensions? Because the projection $\mathbf{u}$ (the element of $\mathcal{H}$ representing the functional) is constructed as a limit of a Cauchy sequence. If the space were not complete, that limit might not exist in $\mathcal{H}$. The theorem fails for incomplete inner product spaces. This is the technical reason we work with Hilbert spaces (complete) rather than just pre-Hilbert spaces (inner product spaces that might not be complete).

Norm Equivalence in Finite Dimensions

In a finite-dimensional space, all norms are equivalent: for any two norms $\lVert\cdot\lVert_\alpha$ and $\lVert\cdot\lVert_\beta$, there exist constants $0 < c \leq C < \infty$ such that:

$$c\lVert\mathbf{v}\lVert_\alpha \leq \lVert\mathbf{v}\lVert_\beta \leq C\lVert\mathbf{v}\lVert_\alpha \quad \text{for all } \mathbf{v}$$

This means all norms on $\mathbb{R}^n$ induce the same topology - the same notion of convergence. A sequence converges in one norm if and only if it converges in all norms.

Proof sketch. The unit sphere $S = \{\mathbf{v} : \lVert\mathbf{v}\lVert_\alpha = 1\}$ is compact (closed and bounded in $\mathbb{R}^n$). The function $\mathbf{v} \mapsto \lVert\mathbf{v}\lVert_\beta$ is continuous on $S$. A continuous function on a compact set attains its minimum and maximum, say $c$ and $C$. Positive definiteness forces $c > 0$. By homogeneity, the inequality extends to all $\mathbf{v}$. $\square$

Explicit equivalence constants between $L^p$ norms on $\mathbb{R}^n$:

$$\lVert\mathbf{v}\lVert_2 \leq \lVert\mathbf{v}\lVert_1 \leq \sqrt{n}\lVert\mathbf{v}\lVert_2$$
$$\lVert\mathbf{v}\lVert_\infty \leq \lVert\mathbf{v}\lVert_2 \leq \sqrt{n}\lVert\mathbf{v}\lVert_\infty$$

In infinite dimensions, this theorem fails entirely: different norms can give genuinely different topologies, and the correct choice of norm is a substantive analytic question.


Summary

Concept Definition Key property
Dot product on $\mathbb{R}^n$ $\mathbf{u} \cdot \mathbf{v} = \sum u_i v_i$ $\mathbf{u} \cdot \mathbf{v} = \lVert\mathbf{u}\lVert\lVert\mathbf{v}\lVert\cos\theta$
Euclidean norm $\lVert\mathbf{v}\lVert_2 = \sqrt{\mathbf{v}\cdot\mathbf{v}}$ Triangle inequality holds
$L^p$ norms $\lVert\mathbf{v}\lVert_p = (\sum |v_i|^p)^{1/p}$ $L^1$: sparsity; $L^2$: smoothness; $L^\infty$: worst-case
Orthogonality $\langle\mathbf{u},\mathbf{v}\rangle = 0$ Pythagorean theorem: $\lVert u+v\lVert^2 = \lVert u\lVert^2 + \lVert v\lVert^2$
Cauchy-Schwarz $|\langle u,v\rangle| \leq \lVert u\lVert\lVert v\lVert$ Equality iff $u \parallel v$; implies triangle inequality
Inner product space $(V, \langle\cdot,\cdot\rangle)$: bilinear, symmetric, positive definite Geometry (lengths, angles) in any vector space
Gram matrix $G_{ij} = \langle\mathbf{v_i},\mathbf{v_j}\rangle$; $G = V^T V$ Positive semidefinite; positive definite iff vectors are independent
Hilbert space Complete inner product space Riesz representation; $V = U \oplus U^\perp$
Parallelogram law $\lVert u+v\lVert^2 + \lVert u-v\lVert^2 = 2\lVert u\lVert^2 + 2\lVert v\lVert^2$ Characterizes norms from inner products

Read next: