Inner Products & Norms // Megha Bose

Prerequisite:

Vector Spaces & Subspaces

Inner Products

An inner product on a vector space $V$ over a field $\mathbb{F}$ (either $\mathbb{R}$ or $\mathbb{C}$) is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{F}$ satisfying three axioms for all $u, v, w \in V$ and $\alpha \in \mathbb{F}$:

Conjugate symmetry: $\langle u, v \rangle = \overline{\langle v, u \rangle}$
Linearity in the first argument: $\langle \alpha u + w, v \rangle = \alpha \langle u, v \rangle + \langle w, v \rangle$
Positive definiteness: $\langle v, v \rangle \geq 0$, with $\langle v, v \rangle = 0 \iff v = 0$

Note that axiom 1 forces $\langle v, v \rangle \in \mathbb{R}$ for all $v$, so axiom 3 is well-posed. A vector space equipped with an inner product is called an inner product space.

Real vs Complex

Over $\mathbb{R}$, conjugate symmetry reduces to ordinary symmetry: $\langle u, v \rangle = \langle v, u \rangle$. Linearity in the first argument then propagates to bilinearity - linearity in both arguments simultaneously.

Over $\mathbb{C}$, conjugate symmetry plus linearity in the first argument forces conjugate linearity (antilinearity) in the second argument: $$\langle u, \alpha v + w \rangle = \bar{\alpha} \langle u, v \rangle + \langle u, w \rangle$$

This is a critical asymmetry. Axler (LADR) adopts the convention of linearity in the first slot; physics literature often reverses this (Dirac notation $\langle \phi | \psi \rangle$ is linear in $|\psi\rangle$). Check the convention before reading any formula.

Examples

Standard dot product on $\mathbb{R}^n$: $$\langle x, y \rangle = x^T y = \sum_{i=1}^n x_i y_i$$

Standard Hermitian inner product on $\mathbb{C}^n$: $$\langle x, y \rangle = x^\ast y = \sum_{i=1}^n \bar{x}_i y_i$$ where $x^\ast = \bar{x}^T$ denotes the conjugate transpose.

Function space $L^2([a,b])$: For continuous (or square-integrable) functions $f, g : [a,b] \to \mathbb{R}$, $$\langle f, g \rangle = \int_a^b f(x), g(x), dx$$ Positive definiteness holds because $\int_a^b f(x)^2, dx = 0$ implies $f = 0$ (almost everywhere). This is the inner product used in Fourier analysis - the orthogonality of $\sin(mx)$ and $\cos(nx)$ is exactly $\langle \sin(mx), \cos(nx) \rangle = 0$.

Weighted inner product: Fix a positive-definite matrix $W$. Then $\langle x, y \rangle_W = x^T W y$ is an inner product on $\mathbb{R}^n$. The Mahalanobis distance is induced by $W = \Sigma^{-1}$ where $\Sigma$ is a covariance matrix.

Norms Induced by Inner Products

Definition. Given an inner product space $(V, \langle \cdot, \cdot \rangle)$, the induced norm is $$|v| = \sqrt{\langle v, v \rangle}$$

This is well-defined and non-negative by positive definiteness. Unpacking the norm axioms from the inner product axioms is an exercise; the non-obvious one is the triangle inequality, which requires Cauchy-Schwarz.

The Cauchy-Schwarz Inequality

Theorem (Cauchy-Schwarz). For any $u, v$ in an inner product space, $$|\langle u, v \rangle| \leq |u| \cdot |v|$$ with equality if and only if $u$ and $v$ are linearly dependent.

Proof. If $v = 0$ both sides are zero; equality holds. Assume $v \neq 0$. For any scalar $t \in \mathbb{F}$, positive definiteness gives $|u - tv|^2 \geq 0$. Expand: $$\langle u - tv, u - tv \rangle = |u|^2 - t\langle v, u \rangle - \bar{t}\langle u, v \rangle + |t|^2 |v|^2 \geq 0$$

Choose $t = \langle u, v \rangle / |v|^2$ (this is the minimiser over $t$). Substituting: $$|u|^2 - \frac{|\langle u, v \rangle|^2}{|v|^2} \geq 0$$

Rearranging gives $|\langle u, v \rangle|^2 \leq |u|^2 |v|^2$, and taking square roots completes the proof. Equality holds iff $|u - tv|^2 = 0$, i.e., $u = tv$. $\square$

Geometric meaning. For $\mathbb{R}^n$ with the dot product, Cauchy-Schwarz says $|x \cdot y| \leq |x||y|$, which is exactly the statement that $\cos\theta = \frac{x \cdot y}{|x||y|}$ lies in $[-1, 1]$. This defines the angle between vectors. In abstract inner product spaces, Cauchy-Schwarz defines what we mean by angle.

Polarization identity. The inner product is recoverable from the norm. Over $\mathbb{R}$: $$\langle u, v \rangle = \frac{1}{4}\left(|u+v|^2 - |u-v|^2\right)$$

Over $\mathbb{C}$: $$\langle u, v \rangle = \frac{1}{4}\left(|u+v|^2 - |u-v|^2 + i|u+iv|^2 - i|u-iv|^2\right)$$

This shows that an inner product space structure is completely determined by its norm - not every norm comes from an inner product, but those that do are characterised by the parallelogram law.

Triangle Inequality and Parallelogram Law

Theorem (Triangle Inequality). $|u + v| \leq |u| + |v|$.

Proof. Compute $|u+v|^2 = \langle u+v, u+v \rangle = |u|^2 + 2,\text{Re}\langle u, v \rangle + |v|^2 \leq |u|^2 + 2|\langle u,v \rangle| + |v|^2 \leq |u|^2 + 2|u||v| + |v|^2 = (|u| + |v|)^2$, where the last step uses Cauchy-Schwarz. Take square roots. $\square$

Theorem (Parallelogram Law). For any $u, v$ in an inner product space: $$|u + v|^2 + |u - v|^2 = 2(|u|^2 + |v|^2)$$

Proof. Expand both squared norms using $|w|^2 = \langle w, w \rangle$ and add; the cross terms cancel. $\square$

The parallelogram law is both a theorem (for inner product spaces) and a characterisation: a normed space satisfying the parallelogram law for all $u, v$ admits a unique inner product inducing the given norm (Jordan-von Neumann theorem). This is why the $L^p$ spaces for $p \neq 2$ are not inner product spaces.

$p$-Norms on $\mathbb{R}^n$

For $x \in \mathbb{R}^n$ and $1 \leq p < \infty$: $$|x|p = \left(\sum{i=1}^n |x_i|^p\right)^{1/p}$$

The limiting case $p \to \infty$: $$|x|\infty = \max{1 \leq i \leq n} |x_i|$$

Special cases: $|x|_1$ is the Manhattan/taxicab norm, $|x|2$ is the Euclidean norm (the unique $p$-norm induced by an inner product), $|x|\infty$ is the Chebyshev norm. Verifying the triangle inequality for general $p$ requires Minkowski’s inequality (the $p$-norm analogue of Cauchy-Schwarz is Holder’s inequality: $\sum |x_i y_i| \leq |x|_p |y|_q$ for $1/p + 1/q = 1$).

Unit balls in R^2 for different p-norms:

p=1 (L1):     p=2 (L2):     p=inf:
    *             *           *****
   ***           ***          *   *
  *   *         *   *         *   *
   ***           ***          *   *
    *             *           *****

  Diamond       Circle        Square

Equivalence of Norms in Finite Dimensions

Theorem. On a finite-dimensional vector space, all norms are equivalent: for any two norms $|\cdot|\alpha$ and $|\cdot|\beta$, there exist constants $c, C > 0$ such that $$c|v|\alpha \leq |v|\beta \leq C|v|_\alpha \quad \text{for all } v$$

Proof sketch. The unit sphere under $|\cdot|\alpha$ is compact (since $\mathbb{R}^n$ is finite-dimensional), and $|\cdot|\beta$ is continuous with respect to $|\cdot|_\alpha$. A continuous function on a compact set attains its max and min; positive definiteness forces these to be positive. $\square$

Equivalence means all norms induce the same topology and the same notion of convergence in finite dimensions. In infinite dimensions, this fails dramatically - different norms can produce genuinely different topologies.

For $\mathbb{R}^n$, the explicit equivalence constants between $|\cdot|_p$ norms are: $$|x|_2 \leq |x|_1 \leq \sqrt{n},|x|2, \quad |x|\infty \leq |x|2 \leq \sqrt{n},|x|\infty$$

Examples

Cosine similarity. The angle formula from Cauchy-Schwarz gives: $$\text{cosine_sim}(x, y) = \frac{\langle x, y \rangle}{|x|_2 |y|_2} \in [-1, 1]$$ This measures directional agreement, ignoring magnitude. In NLP, word embeddings are compared by cosine similarity rather than Euclidean distance because the direction of the embedding captures semantic meaning while the magnitude is an artifact of word frequency.

L2 regularisation (Ridge). Penalises $|\theta|_2^2$ in the loss function. The level sets of $|\theta|_2^2$ are spheres - the regulariser pushes weights toward zero smoothly from all directions. The solution is $\hat{\theta} = (X^T X + \lambda I)^{-1} X^T y$.

L1 regularisation (Lasso). Penalises $|\theta|_1$. The level sets of $|\theta|_1$ are diamonds (in 2D), with corners on the coordinate axes. When the contours of the loss function meet a corner, the optimal $\theta$ has exactly zero components. This is why L1 promotes sparsity while L2 does not: the geometry of the L1 ball has corners where coordinate axes sit, but the L2 ball is smooth everywhere with no preferred directions.

Distances in metric learning. The Mahalanobis distance $d_\Sigma(x,y) = \sqrt{(x-y)^T \Sigma^{-1} (x-y)}$ is the norm induced by the inner product $\langle u, v \rangle = u^T \Sigma^{-1} v$. It accounts for correlations between features and is scale-invariant - it measures distance in standard deviation units, which is why it underlies Gaussian likelihoods and the Mahalanobis anomaly score.

Read Next:

Orthogonality & Projections