Convex Geometry - Shapes Where the Straight Path Is Always the Shortest // Megha Bose

Helpful context:

Vectors, Matrices & Linear Systems - The Language of Simultaneous Equations

A convex set is one where local structure reflects global structure. In optimization, this translates into the single most important fact about convex sets: local optima are global optima. Every other theorem in convex geometry - separation, support, duality - is ultimately a consequence of or a tool for exploiting this property.

To see why this matters, contrast with the non-convex case. A non-convex problem can have exponentially many local optima, none of which need be globally optimal. Gradient descent gets trapped in whichever basin it starts in. There is no certificate that a found solution is globally best. Convex problems have exactly one basin. Gradient descent converges to the global minimum regardless of initialization. The geometry enforces it.

The concrete picture: minimize $f$ over a convex set $C$. Any point where you cannot improve by moving in any feasible direction is the global minimum - because convexity rules out the possibility of hidden better points elsewhere. On a non-convex set, you might be trapped in a valley with a lower valley across an impassable ridge.

This post builds the geometry that makes this precise. The theorems here - separating hyperplane, supporting hyperplane, Farkas' lemma - are the tools that make optimization over convex sets tractable. They give certificates of optimality, characterize feasibility, and underpin LP duality. Understanding them is understanding why convex optimization is a fundamentally different (and tractable) class of problems.

Affine Sets

A set $C \subseteq \mathbb{R}^n$ is affine if for any $x, y \in C$ and any $\theta \in \mathbb{R}$, the point $\theta x + (1 - \theta)y \in C$.

The key difference from convexity (defined below) is that $\theta$ ranges over all of $\mathbb{R}$, not just $[0,1]$. An affine set contains not just the segment between two points but the entire line through them.

Examples. The empty set $\emptyset$ is affine vacuously. Any single point $\{x_0\}$ is affine. Any line through the origin (or not) is affine. Any hyperplane $\{x \mid a^\top x = b\}$ is affine: if $a^\top x = b$ and $a^\top y = b$, then $a^\top(\theta x + (1-\theta)y) = \theta b + (1-\theta)b = b$. The full space $\mathbb{R}^n$ is affine. In general, every affine subspace (translated subspace) is affine, and nothing else is.

Structure theorem. Every nonempty affine set $C$ can be written as $C = x_0 + V$ where $x_0 \in C$ and $V$ is a subspace. Specifically, $V = C - x_0 = \{x - x_0 \mid x \in C\}$.

Proof. Fix any $x_0 \in C$. Let $V = C - x_0$. We verify $V$ is a subspace. If $v, w \in V$ then $x_0 + v, x_0 + w \in C$. Since $C$ is affine, for any $\alpha \in \mathbb{R}$:

$$\alpha(x_0 + v) + (1 - \alpha)(x_0 + w) = x_0 + \alpha v + (1-\alpha)w \in C$$

so $\alpha v + (1-\alpha)w \in V$. Taking $\alpha = 1$ and $\alpha = 0$ gives $v \in V$ and $w \in V$; taking general $\alpha$ shows $V$ is closed under affine combinations. To verify $V$ is a subspace: $0 \in V$ (since $x_0 \in C$); if $v \in V$ then $2 \cdot 0 + (-1) \cdot v = -v \in V$ (affine with $\alpha = -1$); and scaling and addition follow from the affine combination closure. $\square$

The dimension of $C$ is defined as $\dim(V)$.

Affine combinations and affine hull. A point $\sum_{i=1}^k \theta_i x_i$ with $\sum_{i=1}^k \theta_i = 1$ is an affine combination of $x_1, \ldots, x_k$. An affine set contains all affine combinations of its points (and this is an equivalent characterization). The affine hull of a set $S$ is the smallest affine set containing $S$:

$$\text{aff}(S) = \left\{ \sum_{i=1}^k \theta_i x_i ;\Big|; k \geq 1,; x_i \in S,; \sum_{i=1}^k \theta_i = 1 \right\}$$

This is the intersection of all affine sets containing $S$, and equivalently the set of all affine combinations of points in $S$.

Convex Sets

A set $C \subseteq \mathbb{R}^n$ is convex if for any $x, y \in C$ and any $\theta \in [0,1]$, the point $\theta x + (1-\theta)y \in C$.

Geometrically: the line segment between any two points in $C$ lies entirely inside $C$. Every affine set is convex (since $[0,1] \subseteq \mathbb{R}$), but convex sets need not be affine.

Convex combinations. A convex combination of points $x_1, \ldots, x_k$ is any point $\sum_{i=1}^k \theta_i x_i$ with $\theta_i \geq 0$ and $\sum_{i=1}^k \theta_i = 1$. A set is convex if and only if it contains all convex combinations of its points (proved by induction: combine two points, then induct to combine $k$ points).

Examples.

Balls. The Euclidean ball $B(x_c, r) = \{x \mid |x - x_c| \leq r\}$ is convex. If $|x - x_c| \leq r$ and $|y - x_c| \leq r$, then by the triangle inequality: $|\theta x + (1-\theta)y - x_c| = |\theta(x - x_c) + (1-\theta)(y - x_c)| \leq \theta r + (1-\theta)r = r$.
Hyperplanes. $\{x \mid a^\top x = b\}$ is affine, hence convex.
Half-spaces. $H = \{x \mid a^\top x \leq b\}$ is convex: if $a^\top x \leq b$ and $a^\top y \leq b$, then $a^\top(\theta x + (1-\theta)y) = \theta a^\top x + (1-\theta)a^\top y \leq \theta b + (1-\theta)b = b$.
Polyhedra. $\{x \mid Ax \leq b\}$ is convex as an intersection of finitely many half-spaces (shown below: intersections of convex sets are convex).
Norm balls. $\{x \mid |x|_p \leq 1\}$ for any norm $|\cdot|_p$ is convex, since norms satisfy the triangle inequality.

Convexity-preserving operations.

Intersection. If $C_1, C_2$ are convex, so is $C_1 \cap C_2$. Proof: if $x, y \in C_1 \cap C_2$ and $\theta \in [0,1]$, then $\theta x + (1-\theta)y \in C_1$ (since $x, y \in C_1$ and $C_1$ is convex) and likewise for $C_2$. By induction, any finite or infinite intersection of convex sets is convex.
Affine image. If $C$ is convex and $f(x) = Ax + b$ is affine, then $f(C) = \{Ax + b \mid x \in C\}$ is convex. Proof: if $y_1 = Ax_1 + b$ and $y_2 = Ax_2 + b$ with $x_1, x_2 \in C$, then $\theta y_1 + (1-\theta)y_2 = A(\theta x_1 + (1-\theta)x_2) + b \in f(C)$ since $\theta x_1 + (1-\theta)x_2 \in C$.
Inverse affine image. If $C$ is convex and $f$ is affine, then $f^{-1}(C) = \{x \mid f(x) \in C\}$ is convex.
Minkowski sum. If $C_1$ and $C_2$ are convex, so is $C_1 + C_2 = \{x + y \mid x \in C_1, y \in C_2\}$.

Convex Hull

The convex hull of a set $S \subseteq \mathbb{R}^n$, written $\text{conv}(S)$, is the smallest convex set containing $S$. Equivalently:

$$\text{conv}(S) = \left\{ \sum_{i=1}^k \theta_i x_i ;\Big|; k \geq 1,; x_i \in S,; \theta_i \geq 0,; \sum_{i=1}^k \theta_i = 1 \right\}$$

This is the set of all convex combinations of points in $S$, and it equals the intersection of all convex sets containing $S$. The convex hull of a finite set of points is a polytope - a bounded polyhedron.

Caratheodory’s theorem. If $S \subseteq \mathbb{R}^n$, then every point $x \in \text{conv}(S)$ can be written as a convex combination of at most $n+1$ points from $S$.

Proof. Suppose $x = \sum_{i=1}^k \theta_i x_i$ with $\theta_i > 0$, $\sum \theta_i = 1$, $x_i \in S$, and $k > n+1$. We show we can reduce to fewer points. Since $k - 1 > n$, the $k-1$ vectors $x_2 - x_1, \ldots, x_k - x_1 \in \mathbb{R}^n$ are linearly dependent: there exist $\lambda_2, \ldots, \lambda_k$, not all zero, with $\sum_{i=2}^k \lambda_i(x_i - x_1) = 0$. Set $\lambda_1 = -\sum_{i=2}^k \lambda_i$, so $\sum_{i=1}^k \lambda_i x_i = 0$ and $\sum_{i=1}^k \lambda_i = 0$.

For any $t \in \mathbb{R}$: $x = \sum_{i=1}^k (\theta_i - t\lambda_i) x_i$, and $\sum_{i=1}^k (\theta_i - t\lambda_i) = 1$. Choose $t^* = \min\{\theta_i / \lambda_i \mid \lambda_i > 0\}$. Then all coefficients $\theta_i - t^*\lambda_i \geq 0$, and at least one equals zero (the minimizer). This expresses $x$ as a convex combination of at most $k-1$ points. Repeating until $k \leq n+1$ gives the result. $\square$

Caratheodory’s theorem has a striking consequence: to understand the convex hull of any set in $\mathbb{R}^n$, you only ever need to consider $(n+1)$-point combinations, regardless of how large $S$ is.

Convex Cones

A set $C \subseteq \mathbb{R}^n$ is a cone if for any $x \in C$ and $\theta \geq 0$, we have $\theta x \in C$. A set is a convex cone if it is both convex and a cone - equivalently, if $\theta_1 x + \theta_2 y \in C$ for all $x, y \in C$ and $\theta_1, \theta_2 \geq 0$.

The conic hull of $S$ is the smallest convex cone containing $S$:

$$\text{cone}(S) = \left\{ \sum_{i=1}^k \theta_i x_i ;\Big|; k \geq 1,; x_i \in S,; \theta_i \geq 0 \right\}$$

Examples.

Nonnegative orthant. $\mathbb{R}^n_+ = \{x \in \mathbb{R}^n \mid x_i \geq 0 \text{ for all } i\}$ is a convex cone.
Second-order cone. $\{(x, t) \in \mathbb{R}^{n+1} \mid |x|_2 \leq t\}$ is a convex cone. Convexity follows from the triangle inequality for norms.
Positive semidefinite cone. $\mathbb{S}^n_+ = \{A \in \mathbb{R}^{n \times n} \mid A = A^\top,; v^\top A v \geq 0 \text{ for all } v\}$ is a convex cone: if $A, B \succeq 0$ and $\theta_1, \theta_2 \geq 0$, then $v^\top(\theta_1 A + \theta_2 B)v = \theta_1 v^\top Av + \theta_2 v^\top Bv \geq 0$.

Dual cone. The dual cone of $C$ is:

$$C^* = \{y \in \mathbb{R}^n \mid y^\top x \geq 0 \text{ for all } x \in C\}$$

The dual cone is always a closed convex cone (as an intersection of closed half-spaces through the origin). Self-dual cones satisfy $C^* = C$; the nonnegative orthant and the PSD cone are self-dual. Dual cones appear naturally in optimality conditions and duality theory.

Polyhedra

A polyhedron is a set of the form:

$$\mathcal{P} = \{x \in \mathbb{R}^n \mid Ax \leq b,; Cx = d\}$$

where $A \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^m$, $C \in \mathbb{R}^{p \times n}$, $d \in \mathbb{R}^p$. It is the intersection of finitely many half-spaces (from $Ax \leq b$) and hyperplanes (from $Cx = d$), hence always convex.

A polytope is a bounded polyhedron - equivalently, the convex hull of finitely many points.

Standard forms. The standard form of a polyhedron is $\{x \mid Ax = b,; x \geq 0\}$. Any polyhedron can be converted to standard form by introducing slack variables: $Ax \leq b$ becomes $Ax + s = b$, $s \geq 0$. The feasible set of a linear program is always a polyhedron, since LP constraints are exactly inequalities $Ax \leq b$ and equalities $Cx = d$.

Every vertex (extreme point) of a polyhedron $\{x \mid Ax \leq b\}$ corresponds to a basic feasible solution: $n$ linearly independent active constraints ($a_i^\top x = b_i$). The simplex method moves between adjacent vertices, and the optimal value of a linear program (when finite) is always attained at a vertex. This is the geometric content behind LP theory.

The Closest Point Theorem

This is the foundational existence-and-uniqueness theorem of convex analysis. Every other theorem in this post descends from it.

Theorem (Closest Point / Projection). Let $C \subseteq \mathbb{R}^n$ be a nonempty, closed, convex set and let $y \notin C$. Then:

There exists a unique point $x^{\ast} \in C$ minimizing $\|x - y\|^2$ over $C$. We write $x^{\ast} = P_C(y)$ and call it the projection of $y$ onto $C$.
$x^{\ast} = P_C(y)$ if and only if $(y - x^{\ast})^\top(x - x^{\ast}) \leq 0$ for all $x \in C$.

Proof of existence. The function $f(x) = \|x - y\|^2$ is continuous and coercive ($f(x) \to \infty$ as $\|x\| \to \infty$). Fix any $z \in C$ and let $d_0 = \|z - y\|$. The minimum over $C$ equals the minimum over the compact set $C \cap \overline{B}(y, d_0)$ (points outside this ball have $f > d_0^2$ and cannot be optimal). A continuous function on a compact set attains its minimum (Weierstrass theorem), so the minimum is attained.

Proof of uniqueness. Suppose $x_1^{\ast}, x_2^{\ast} \in C$ both achieve the minimum value $d^2 = \min_{x \in C}\|x - y\|^2$, so $\|x_1^{\ast} - y\| = \|x_2^{\ast} - y\| = d$. By convexity, the midpoint $m = (x_1^{\ast} + x_2^{\ast})/2 \in C$. Apply the parallelogram identity:

$$\|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2$$

with $u = y - x_1^{\ast}$ and $v = y - x_2^{\ast}$. Then $u + v = 2(y - m)$ and $u - v = x_2^{\ast} - x_1^{\ast}$:

$$4\|y - m\|^2 + \|x_2^{\ast} - x_1^{\ast}\|^2 = 2\|y - x_1^{\ast}\|^2 + 2\|y - x_2^{\ast}\|^2 = 4d^2$$

So $\|y - m\|^2 = d^2 - \frac{1}{4}\|x_2^{\ast} - x_1^{\ast}\|^2$. Since $m \in C$ and $d$ is the minimum, $\|y - m\|^2 \geq d^2$. This forces $\|x_2^{\ast} - x_1^{\ast}\|^2 \leq 0$, so $x_1^{\ast} = x_2^{\ast}$. $\square$

Proof of the characterization (part 2). Fix $x^{\ast} = P_C(y)$ and any $x \in C$. For $t \in [0,1]$, the point $x^{\ast} + t(x - x^{\ast}) = (1-t)x^{\ast} + tx \in C$ by convexity. Let $g(t) = \|y - x^{\ast} - t(x - x^{\ast})\|^2$. Since $x^{\ast}$ minimizes over $C$, $g(0) \leq g(t)$ for all $t \in [0,1]$. Expanding:

$$g(t) = \|y - x^{\ast}\|^2 - 2t(y - x^{\ast})^\top(x - x^{\ast}) + t^2\|x - x^{\ast}\|^2$$

The derivative at $t = 0$ is $g'(0) = -2(y - x^{\ast})^\top(x - x^{\ast})$. Since $g(0)$ is a minimum on $[0,1]$, we need $g'(0) \geq 0$, i.e., $(y - x^{\ast})^\top(x - x^{\ast}) \leq 0$.

Conversely, if $(y - x^{\ast})^\top(x - x^{\ast}) \leq 0$ for all $x \in C$, then for any $x \in C$:

$$\|y - x\|^2 = \|y - x^{\ast} + x^{\ast} - x\|^2 = \|y - x^{\ast}\|^2 + 2(y - x^{\ast})^\top(x^{\ast} - x) + \|x^{\ast} - x\|^2 \geq \|y - x^{\ast}\|^2$$

since $(y - x^{\ast})^\top(x^{\ast} - x) = -(y - x^{\ast})^\top(x - x^{\ast}) \geq 0$ and $\|x^{\ast} - x\|^2 \geq 0$. So $x^{\ast}$ is the minimizer. $\square$

Geometric meaning of the characterization. The vector $y - x^{\ast}$ (pointing from $x^{\ast}$ toward $y$) makes an obtuse angle with every direction $x - x^{\ast}$ from $x^{\ast}$ into $C$. Equivalently: $y$ and $C$ are on opposite sides of the hyperplane tangent to $C$ at $x^{\ast}$. This is the geometric insight behind the separating hyperplane theorem.

The Separating Hyperplane Theorem

Theorem. Let $C$ and $D$ be nonempty, disjoint, convex sets in $\mathbb{R}^n$. Then there exists $a \neq 0$ and $b \in \mathbb{R}$ such that $a^\top x \leq b$ for all $x \in C$ and $a^\top x \geq b$ for all $x \in D$.

The hyperplane $\{x \mid a^\top x = b\}$ separates $C$ from $D$.

Proof. Consider the Minkowski difference $E = D - C = \{y - x \mid y \in D, x \in C\}$. Since $C$ and $D$ are convex, $E$ is convex. Since $C \cap D = \emptyset$, $0 \notin E$.

Let $\bar{E}$ denote the closure of $E$. Since $0 \notin E$, either $0 \notin \bar{E}$ (easy case) or $0 \in \partial \bar{E}$ (boundary case); in both cases we can apply the projection argument. For clarity, assume $C$ and $D$ are closed (the general case follows by passing to the closure of $E$).

Apply the Closest Point Theorem to $E$ and the point $0$: there exists a unique nearest point $z^{\ast} = P_E(0)$ with $z^{\ast} \neq 0$. By the characterization:

$$(0 - z^{\ast})^\top(z - z^{\ast}) \leq 0 \quad \text{for all } z \in E$$

i.e., $-(z^{\ast})^\top z + \|z^{\ast}\|^2 \leq 0$, so $(z^{\ast})^\top z \geq \|z^{\ast}\|^2 > 0$ for all $z \in E$.

Write $z = y - x$ for $y \in D$, $x \in C$. The condition $(z^{\ast})^\top(y - x) \geq \|z^{\ast}\|^2 > 0$ for all $x \in C$, $y \in D$ means:

$$(z^{\ast})^\top y \geq (z^{\ast})^\top x + \|z^{\ast}\|^2 > (z^{\ast})^\top x \quad \text{for all } x \in C, y \in D$$

Set $a = z^{\ast}$ and $b = \sup_{x \in C} a^\top x$. Then $a^\top x \leq b$ for all $x \in C$ and $a^\top y \geq \inf_{y \in D} a^\top y \geq b + \|a\|^2 > b$ for all $y \in D$. $\square$

Strict separation. If $C$ is closed and $D = \{y_0\}$ is a single point not in $C$, the projection theorem gives strict separation: the nearest point $x^{\ast} = P_C(y_0)$ satisfies $a^\top y_0 > b > a^\top x$ strictly for all $x$ in the interior of $C$ relative to the boundary. Strict separation also holds when $C$ and $D$ are closed and disjoint and one of them is compact.

Converse. Convex sets that cannot be separated exist only when they share boundary points. The separating hyperplane theorem is, in this sense, tight.

The Supporting Hyperplane Theorem

A supporting hyperplane to a convex set $C$ at a boundary point $x_0 \in \partial C$ is a hyperplane $\{x \mid a^\top x = a^\top x_0\}$ (with $a \neq 0$) such that $a^\top x \leq a^\top x_0$ for all $x \in C$ - the entire set $C$ lies in one closed half-space.

Theorem. Let $C \subseteq \mathbb{R}^n$ be a nonempty convex set and $x_0 \in \partial C$ a boundary point. Then there exists a supporting hyperplane to $C$ at $x_0$.

Proof. Since $x_0 \in \partial C$, there exists a sequence $y_k \notin C$ with $y_k \to x_0$. For each $k$, apply the Closest Point Theorem to obtain $x_k^{\ast} = P_C(y_k)$, the nearest point in $C$ to $y_k$. Define the unit normal:

$$a_k = \frac{y_k - x_k^{\ast}}{\|y_k - x_k^{\ast}\|}$$

Since $C$ is closed and $y_k \to x_0 \in \partial C$, the projections satisfy $x_k^{\ast} \to x_0$. By the characterization of projection:

$$(y_k - x_k^{\ast})^\top(x - x_k^{\ast}) \leq 0 \quad \text{for all } x \in C$$

Dividing by $\|y_k - x_k^{\ast}\| > 0$:

$$a_k^\top(x - x_k^*) \leq 0 \quad \text{for all } x \in C$$

The sequence $(a_k)$ lies on the unit sphere $\mathbb{S}^{n-1}$, which is compact. Extract a convergent subsequence $a_{k_j} \to a$ with $|a| = 1$. Taking the limit in the inequality:

$$a^\top(x - x_0) \leq 0 \quad \text{for all } x \in C$$

That is, $a^\top x \leq a^\top x_0$ for all $x \in C$, so $\{x \mid a^\top x = a^\top x_0\}$ is a supporting hyperplane at $x_0$. $\square$

Smoothness and uniqueness. If $C$ has a smooth boundary at $x_0$ (the boundary is locally a smooth manifold), the supporting hyperplane is unique - it is the tangent hyperplane. At corners or edges, multiple supporting hyperplanes may exist.

Converse. If every boundary point of a closed set $C$ admits a supporting hyperplane, then $C$ is convex. This gives a characterization of convexity in terms of supporting hyperplanes.

Farkas' Lemma

Farkas' lemma is the precise statement of when a linear system with nonnegativity constraints is feasible. It is the cornerstone of linear programming duality.

Theorem (Farkas' Lemma). Let $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$. Exactly one of the following holds:

(I) There exists $x \geq 0$ such that $Ax = b$.
(II) There exists $y \in \mathbb{R}^m$ such that $A^\top y \geq 0$ and $b^\top y < 0$.

Proof. Let $K = \{Ax \mid x \geq 0\} = \text{cone}(a_1, \ldots, a_n)$ be the conic hull of the columns $a_1, \ldots, a_n$ of $A$. This is a closed convex cone (a fact we use without full proof: $K$ is closed because it is the image of $\mathbb{R}^n_+$ under $A$, and while images of closed sets under linear maps need not be closed, the image of $\mathbb{R}^n_+$ - a closed polyhedral cone - under a linear map is always closed).

(I) holds $\Rightarrow$ (II) fails. If $b = Ax_0$ for some $x_0 \geq 0$, and if $A^\top y \geq 0$, then $b^\top y = x_0^\top A^\top y = \sum_i (x_0)_i (A^\top y)_i \geq 0$. So (II) cannot hold.

(II) holds $\Rightarrow$ (I) fails. If $b^\top y < 0$ and $A^\top y \geq 0$, then for any $x \geq 0$: $(Ax - b)^\top y = x^\top A^\top y - b^\top y \geq 0 - b^\top y > 0$. So $Ax \neq b$ for all $x \geq 0$, meaning (I) fails.

(I) fails $\Rightarrow$ (II) holds. This is the main content. If $b \notin K$, apply the Separating Hyperplane Theorem to $\{b\}$ and $K$: there exists $y$ with $y^\top b < y^\top z$ for all $z \in K$. Since $0 \in K$ (take $x = 0$), we get $y^\top b < 0$. Since $K$ is a cone: for each column $a_i$, and any $\lambda \geq 0$, $\lambda a_i \in K$, so $y^\top(\lambda a_i) > y^\top b$ for all $\lambda \geq 0$. If $y^\top a_i < 0$ for some $i$, then $\lambda \to \infty$ gives $y^\top(\lambda a_i) \to -\infty < y^\top b$, a contradiction. So $y^\top a_i \geq 0$ for all $i$, i.e., $A^\top y \geq 0$. Together with $b^\top y < 0$, this is (II). $\square$

Interpretation via duality. Farkas' lemma says: $b$ is representable as a nonneg combination of columns of $A$ (primal feasibility) if and only if every halfspace through the origin containing all columns of $A$ also contains $b$. This is a complete primal-dual description. In LP duality terms: the primal LP $\min c^\top x$ s.t. $Ax = b$, $x \geq 0$ has a dual, and Farkas' lemma characterizes exactly when primal and dual feasibility hold simultaneously.

Extreme Points

A point $x \in C$ is an extreme point of a convex set $C$ if it cannot be expressed as a strict convex combination of two distinct points in $C$: if $x = \lambda y + (1-\lambda)z$ with $y, z \in C$ and $\lambda \in (0,1)$, then necessarily $y = z = x$.

Equivalently, $x$ is extreme if and only if $C \setminus \{x\}$ is still convex.

Extreme points are the “corners” or “vertices” of the convex set - the points with no slack, the ones that cannot be expressed as averages of other points in the set.

Extreme points of polyhedra. For the polyhedron $\mathcal{P} = \{x \mid Ax \leq b\}$, a point $x^{\ast} \in \mathcal{P}$ is extreme if and only if the active constraints $\{a_i^\top x = b_i\}$ at $x^{\ast}$ include $n$ linearly independent rows. These correspond exactly to the vertices of the polyhedron. For a polytope (bounded polyhedron), every point can be written as a convex combination of vertices.

Krein-Milman theorem. Every nonempty compact convex set $C \subseteq \mathbb{R}^n$ is the convex hull of its extreme points:

$$C = \text{conv}(\text{ext}(C))$$

where $\text{ext}(C)$ denotes the set of extreme points of $C$. This theorem (stated here without proof) is fundamental: it says that compact convex sets are completely determined by their “corners.” For polytopes, $\text{ext}(C)$ is finite and equals the vertex set, and the theorem reduces to the statement that every polytope is the convex hull of its vertices.

The Krein-Milman theorem has a key consequence for optimization: the minimum of any linear function over a compact convex set is attained at an extreme point. This is the geometric foundation of the simplex method - when optimizing a linear objective over a polytope, it suffices to search over vertices.

Why Convexity: Local Equals Global

The entire edifice of convex analysis rests on one fact. Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function and $C \subseteq \mathbb{R}^n$ a convex set. Any local minimum of $f$ over $C$ is a global minimum.

Proof. Suppose $x^{\ast}$ is a local minimum: $f(x^{\ast}) \leq f(x)$ for all $x$ in some neighborhood $B(x^{\ast}, \varepsilon) \cap C$. Suppose for contradiction that there exists $y \in C$ with $f(y) < f(x^{\ast})$. Consider the point $z_\lambda = (1-\lambda)x^{\ast} + \lambda y$ for small $\lambda > 0$. By convexity of $C$, $z_\lambda \in C$. By convexity of $f$:

$$f(z_\lambda) \leq (1-\lambda) f(x^{\ast}) + \lambda f(y) < (1-\lambda) f(x^{\ast}) + \lambda f(x^{\ast}) = f(x^{\ast})$$

For sufficiently small $\lambda$, $z_\lambda \in B(x^{\ast}, \varepsilon)$. So $f(z_\lambda) < f(x^{\ast})$ with $z_\lambda$ in the local neighborhood - contradicting local minimality. $\square$

This is why convex optimization is a fundamentally different class of problem from non-convex optimization. Gradient descent on a convex function provably converges to the global minimum. On a non-convex function, it finds a local minimum - which may be arbitrarily worse than the global.

The geometric picture. A convex function has exactly one “bowl.” You can think of gradient descent as rolling a ball down a bowl - it always reaches the bottom. Non-convex functions have multiple bowls, and a ball starting in the wrong basin gets stuck.

When convexity breaks down. Neural networks have non-convex loss surfaces. The composition of a convex output loss with a non-linear network is generally non-convex. Deep learning practitioners work in a regime where the function has many local minima - and, empirically, many of them are nearly as good as the global minimum. This is a lucky accident of overparameterization, not a guarantee. Theoretical understanding of why non-convex deep learning works remains one of the central open problems in machine learning.

The Separating Hyperplane as a certificate. The separating hyperplane theorem is the mechanism behind convex duality. If $x^{\ast}$ is optimal, the supporting hyperplane at $x^{\ast}$ (tangent to the feasible set) is a certificate: it says “no feasible point does better than $x^{\ast}$” without having to check the entire feasible set. This is why LP duality works - the dual solution is exactly this certificate.

Summary

Concept	Definition	Key Property
Affine set	Contains all affine combinations	$C = x_0 + V$, $V$ a subspace
Convex set	Contains all convex combinations	Closed under $\cap$, affine images, Minkowski sum
Affine hull $\text{aff}(S)$	Smallest affine set containing $S$	All affine combinations of points in $S$
Convex hull $\text{conv}(S)$	Smallest convex set containing $S$	By Caratheodory: at most $n+1$ points suffice
Convex cone	Closed under nonneg scaling and addition	Dual cone $C^* = \{y \mid y^\top x \geq 0; \forall x \in C\}$
Polyhedron	$\{x \mid Ax \leq b, Cx = d\}$	Finite intersection of halfspaces; always convex
Projection $P_C(y)$	Unique nearest point in closed convex $C$	$(y - x^{\ast})^\top(x - x^{\ast}) \leq 0$ for all $x \in C$
Separating hyperplane	Separates disjoint convex sets	Follows from projection theorem
Supporting hyperplane	Touches $C$ at boundary; $C$ on one side	Exists at every boundary point
Farkas' lemma	$Ax = b$, $x \geq 0$ feasible iff no separator	Foundation of LP duality
Extreme point	Cannot be strict convex combination	Krein-Milman: compact convex = conv(extreme points)

Read next: