Convexity & Optimization
Prerequisite:
Convexity is the single most important structural property in optimization. A convex problem has no bad local minima - every local minimum is global. This transforms optimization from a search problem (where are all the valleys?) into something far more tractable. Most of the machinery in machine learning - gradient descent, regularization, SVM, logistic regression - either exploits convexity directly or borrows intuitions from it.
Convex Sets
Definition. A set $C \subseteq \mathbb{R}^n$ is convex if for every $x, y \in C$ and every $\lambda \in [0, 1]$,
$$\lambda x + (1 - \lambda) y \in C.$$
The point $\lambda x + (1 - \lambda) y$ is a convex combination of $x$ and $y$: a weighted average that lies on the line segment between them. Convexity says the segment itself lies inside $C$.
Common examples: any affine subspace (lines, planes, hyperplanes), halfspaces ${x : a^T x \leq b}$, balls ${x : |x| \leq r}$, the positive semidefinite cone $\mathbf{S}^n_+$, and the probability simplex ${\lambda \geq 0 : \sum \lambda_i = 1}$.
Non-example: a ring (annulus), or a non-convex polygon - any shape with a dent or hole.
Operations that preserve convexity: intersection of convex sets, image under an affine map, convex hull of any set.
Convex Functions
Definition. A function $f : C \to \mathbb{R}$ (with $C$ convex) is convex if for all $x, y \in C$ and $\lambda \in [0, 1]$,
$$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y).$$
The right-hand side is the value on the chord from $(x, f(x))$ to $(y, f(y))$. Convexity says the function lies below every chord - the chord condition.
f(x)
| * chord
| * *
| * *
| * f(midpoint)*
| * *
+--x-----------y------> x
Convex: curve stays below chord
f(x)
| * *
| * *
| * (above) *
| * *
| * * * <- curve goes above chord here
+--x-----------y------> x
Non-convex: curve rises above chord
$f$ is strictly convex if the inequality is strict for $x \neq y$ and $\lambda \in (0, 1)$.
$f$ is concave if $-f$ is convex.
Equivalent Characterizations
Theorem (First-Order Criterion). Suppose $f$ is differentiable on the open convex set $C$. Then $f$ is convex if and only if for all $x, y \in C$,
$$f(y) \geq f(x) + \nabla f(x)^T (y - x).$$
This says the graph lies above every tangent hyperplane. The tangent plane at any point is a global underestimator. This is an extraordinarily powerful fact: a single local piece of information (the gradient at $x$) gives a globally valid lower bound on $f$.
Theorem (Second-Order Criterion). Suppose $f$ is twice differentiable on the open convex set $C$. Then $f$ is convex if and only if its Hessian is positive semidefinite everywhere:
$$\nabla^2 f(x) \succeq 0 \quad \text{for all } x \in C.$$
Strict convexity follows from $\nabla^2 f(x) \succ 0$ (positive definite), though this is sufficient, not necessary.
Strongly Convex Functions
Definition. $f$ is $\mu$-strongly convex (with parameter $\mu > 0$) if for all $x, y \in C$,
$$f(y) \geq f(x) + \nabla f(x)^T (y - x) + \frac{\mu}{2} |y - x|^2.$$
Strong convexity adds a quadratic lower bound to the gradient inequality. Equivalently, $f(x) - \frac{\mu}{2}|x|^2$ is convex, or $\nabla^2 f(x) \succeq \mu I$ everywhere.
Strong convexity implies the function curves upward at least as fast as a quadratic. It implies a unique minimizer: if $x^\ast$ is the minimizer, then
$$f(x) \geq f(x^\ast) + \frac{\mu}{2}|x - x^\ast|^2,$$
so $f$ grows quadratically away from $x^\ast$.
Example: $f(x) = \frac{\mu}{2}|x|^2 + g(x)$ for any convex $g$ is $\mu$-strongly convex. Adding an L2 regularizer $\frac{\mu}{2}|w|^2$ to any convex loss makes the overall objective strongly convex.
The Key Theorem: Local = Global
Theorem. If $f$ is convex and $x^\ast$ is a local minimum of $f$, then $x^\ast$ is a global minimum.
Proof. Suppose for contradiction there exists $y$ with $f(y) < f(x^\ast)$. For small $\lambda > 0$, the point $z = x^\ast + \lambda(y - x^\ast)$ satisfies $f(z) \leq (1-\lambda)f(x^\ast) + \lambda f(y) < f(x^\ast)$, contradicting local minimality since $z \to x^\ast$ as $\lambda \to 0$. $\square$
Corollary. For differentiable convex $f$, $x^\ast$ is a global minimizer if and only if $\nabla f(x^\ast) = 0$.
This is why convex optimization is tractable: you can find the global minimum by finding any critical point.
Gradient Descent and Convergence Rates
The gradient descent update is
$$x_{k+1} = x_k - \alpha \nabla f(x_k)$$
where $\alpha > 0$ is the step size (learning rate). Assume $f$ has $L$-Lipschitz gradient: $|\nabla f(x) - \nabla f(y)| \leq L|x - y|$, which is equivalent to $\nabla^2 f \preceq LI$.
Theorem (Convergence for Convex $f$). If $f$ is convex with $L$-Lipschitz gradient, then gradient descent with step size $\alpha = 1/L$ satisfies
$$f(x_k) - f(x^\ast) \leq \frac{L |x_0 - x^\ast|^2}{2k}.$$
The convergence rate is $O(1/k)$ - to get $\epsilon$-accuracy requires $O(1/\epsilon)$ iterations.
Theorem (Convergence for Strongly Convex $f$). If $f$ is $\mu$-strongly convex with $L$-Lipschitz gradient, then gradient descent with step size $\alpha = 2/(\mu + L)$ satisfies
$$|x_k - x^\ast|^2 \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} |x_0 - x^\ast|^2.$$
The convergence is linear (geometric): $|x_k - x^\ast|^2 \leq \rho^k |x_0 - x^\ast|^2$ for $\rho = \left(\frac{\kappa - 1}{\kappa + 1}\right)^2 < 1$, where $\kappa = L/\mu$ is the condition number. This is $O(e^{-k/\kappa})$, so $O(\kappa \log(1/\epsilon))$ iterations for $\epsilon$-accuracy. The condition number measures how elongated the level sets are - a poorly conditioned problem converges slowly.
Common Convex Functions
| Function | Domain | Convex? |
|---|---|---|
| $e^{ax}$ | $\mathbb{R}$ | Yes (for any $a$) |
| $x \log x$ | $x > 0$ | Yes |
| $-\log x$ | $x > 0$ | Yes |
| $|x|_p$ for $p \geq 1$ | $\mathbb{R}^n$ | Yes |
| All norms | $\mathbb{R}^n$ | Yes |
| $x^2$ | $\mathbb{R}$ | Yes |
| $\log(1 + e^x)$ | $\mathbb{R}$ | Yes |
| $\max(x_1, \ldots, x_n)$ | $\mathbb{R}^n$ | Yes |
The key convexity-preserving operations: if $f, g$ are convex, then $f + g$ is convex; $\alpha f$ is convex for $\alpha \geq 0$; $f(Ax + b)$ is convex (composition with affine map); $\max(f, g)$ is convex; and $\sup_{\alpha} f_\alpha(x)$ over any index set is convex.
Subgradients and Non-Smooth Functions
Many useful convex functions are not differentiable - the absolute value $|x|$ has a kink at zero, and $|x|_1 = \sum |x_i|$ has kinks along every coordinate hyperplane. Gradient descent cannot be applied directly.
Definition. A vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at $x$ if for all $y$,
$$f(y) \geq f(x) + g^T (y - x).$$
The subdifferential $\partial f(x)$ is the set of all subgradients at $x$. For differentiable $f$, $\partial f(x) = {\nabla f(x)}$.
For $f(x) = |x|$: $\partial f(0) = [-1, 1]$, and $\partial f(x) = {\text{sign}(x)}$ for $x \neq 0$.
Optimality condition. $x^\ast$ minimizes $f$ if and only if $0 \in \partial f(x^\ast)$.
Subgradient methods replace $\nabla f(x_k)$ with any $g_k \in \partial f(x_k)$. They converge at $O(1/\sqrt{k})$ rate, slower than smooth gradient descent.
Proximal Operators
For non-smooth but “structured” convex problems, proximal methods are the tool of choice.
Definition. The proximal operator of $h$ with parameter $\alpha$ is
$$\text{prox}_{\alpha h}(v) = \arg\min_x \left{ h(x) + \frac{1}{2\alpha} |x - v|^2 \right}.$$
It finds the point that minimizes $h$ while staying close to $v$.
For $h(x) = |x|_1$, the proximal operator is the soft-thresholding function:
$$\text{prox}_{\alpha |\cdot|_1}(v)_i = \text{sign}(v_i) \max(|v_i| - \alpha, 0).$$
This is the basis of LASSO and sparse recovery algorithms.
Proximal Gradient Descent. For $f = g + h$ where $g$ is smooth convex and $h$ is convex but possibly non-smooth:
$$x_{k+1} = \text{prox}_{\alpha h}\left(x_k - \alpha \nabla g(x_k)\right).$$
This takes a gradient step on the smooth part, then applies the proximal operator for the non-smooth part. It converges at $O(1/k)$ for convex $g$ and $O(1/k^2)$ with Nesterov acceleration.
Examples
Loss functions in ML are often chosen to be convex in the model parameters.
Mean squared error: $\ell(w) = \frac{1}{n}\sum_{i=1}^n (w^T x_i - y_i)^2$ is convex (and quadratic) in $w$.
Logistic loss / cross-entropy: $\ell(w) = -y \log \sigma(w^T x) - (1-y)\log(1 - \sigma(w^T x))$ is convex in $w$, where $\sigma$ is the sigmoid. This is because $\log(1 + e^t)$ is convex and $w \mapsto w^T x$ is linear.
Hinge loss (SVM): $\ell(w) = \max(0, 1 - y w^T x)$ is convex as a pointwise max of linear functions.
L1 regularization $\lambda|w|_1$ is convex (a norm), promotes sparsity, and leads to the LASSO problem. L2 regularization $\frac{\lambda}{2}|w|_2^2$ is strongly convex, shrinks weights toward zero, and leads to ridge regression. Both preserve convexity of the overall objective when added to a convex loss.
Neural network training is, in general, not convex in the weights - the composition of nonlinear activations creates a highly non-convex landscape. However, the tools developed for convex optimization (gradient descent, momentum, adaptive methods) are applied anyway, often with impressive empirical success. Understanding convexity provides the theoretical baseline that makes the difficulty of deep learning precise.
Read Next: