Convex Optimization
Prerequisite:
Convex optimization is the study of minimizing convex functions over convex sets. It sits at the intersection of linear algebra, analysis, and algorithms - and it is the theoretical engine behind a large fraction of practical machine learning. This post covers the standard problem formulations, the duality theory that underlies solvers, and the main algorithmic families used in practice.
Standard Form Convex Program
The canonical convex optimization problem is
$$\min_{x \in \mathbb{R}^n} f_0(x) \quad \text{subject to} \quad f_i(x) \leq 0 ; (i = 1, \ldots, m), \quad Ax = b,$$
where $f_0, f_1, \ldots, f_m$ are convex functions and the equality constraints $Ax = b$ are affine. The feasible set $\mathcal{F} = {x : f_i(x) \leq 0, Ax = b}$ is convex.
Because the objective and feasible set are convex, every local minimum is a global minimum (by the local=global theorem). This is what makes the problem tractable: any descent algorithm that reduces $f_0$ and terminates at a stationary point has found the global answer.
Problem Classes
Convex programs form a hierarchy of increasing expressiveness.
Linear Programming (LP)
$$\min c^T x \quad \text{subject to} \quad Ax \leq b, \quad Cx = d.$$
Objective and constraints are all linear (hence convex). Optimal solutions lie at vertices of the feasible polytope. Solvable in polynomial time via interior point methods; simplex is exponential worst-case but fast in practice.
Applications: resource allocation, network flow, shortest paths as LP, LP relaxations of integer programs.
Quadratic Programming (QP)
$$\min \frac{1}{2}x^T P x + q^T x \quad \text{subject to} \quad Ax \leq b,$$
with $P \succeq 0$. The SVM primal and dual are QPs. LASSO can be recast as a QP. Solvable by active-set methods or interior point.
Second-Order Cone Programming (SOCP)
Constraints of the form $|Ax + b| \leq c^T x + d$. Subsumes LP and QP. Arises in robust optimization and Lorentz cone constraints.
Semidefinite Programming (SDP)
$$\min C \bullet X \quad \text{subject to} \quad A_i \bullet X = b_i, \quad X \succeq 0,$$
where $X$ is a symmetric positive semidefinite matrix and $\bullet$ denotes trace inner product. SDPs are the most general tractable class; they appear in spectral graph theory, sum-of-squares relaxations, and control theory.
Duality Theory
Every convex program has a dual problem obtained via the Lagrangian. The Lagrangian for the standard form is
$$\mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \nu^T(Ax - b).$$
The dual function is
$$g(\lambda, \nu) = \inf_{x} \mathcal{L}(x, \lambda, \nu),$$
and the dual problem is $\max_{\lambda \geq 0, \nu} g(\lambda, \nu)$.
Theorem (Weak Duality). For any primal feasible $x$ and dual feasible $(\lambda, \nu)$ (with $\lambda \geq 0$),
$$g(\lambda, \nu) \leq f_0(x).$$
The duality gap $f_0(x^\ast) - g(\lambda^\ast, \nu^\ast)$ is always non-negative.
Theorem (Strong Duality - Slater’s Condition). If the primal problem is convex and there exists a strictly feasible point $\tilde{x}$ (with $f_i(\tilde{x}) < 0$ for all $i$ and $A\tilde{x} = b$), then strong duality holds: $f_0(x^\ast) = g(\lambda^\ast, \nu^\ast)$.
Strong duality enables the dual approach: solve the dual (often simpler - it is always concave regardless of the primal) and recover the primal solution from the dual via complementary slackness $\lambda_i^\ast f_i(x^\ast) = 0$.
Interior Point Methods
Interior point (barrier) methods are the workhorse for LP, QP, SOCP, and SDP - they achieve polynomial-time complexity with excellent practical performance.
The key idea: replace inequality constraints with a log-barrier term that goes to $+\infty$ as any constraint becomes active.
Barrier function: For constraints $f_i(x) \leq 0$,
$$\phi(x) = -\sum_{i=1}^m \log(-f_i(x)).$$
$\phi$ is convex, defined only for strictly feasible $x$, and $\phi(x) \to +\infty$ as $x$ approaches the boundary of the feasible set.
Barrier method: Solve the sequence of unconstrained problems (equality constraints $Ax = b$ handled separately via elimination or KKT):
$$\min_x ; t \cdot f_0(x) + \phi(x) \quad \text{(the central path problem)}$$
for increasing $t > 0$. As $t \to \infty$, the solution $x^\ast(t)$ converges to the primal optimal $x^\ast$. The path ${x^\ast(t) : t > 0}$ is the central path.
At each $t$, the barrier problem is an unconstrained smooth convex problem, solved by Newton’s method. Newton’s method has quadratic convergence near the solution (error squares each step), so very few Newton steps are needed per outer iteration.
Complexity: Interior point methods solve LP/QP to $\epsilon$-accuracy in $O(\sqrt{m} \log(1/\epsilon))$ Newton steps, each costing $O(n^3)$ for the linear system solve. Total: $O(n^3 \sqrt{m} \log(1/\epsilon))$.
Gradient Methods
For large-scale problems (large $n$, many data points), Newton methods are too expensive - the $n \times n$ Hessian cannot be formed or factored. Gradient methods access only first-order information.
Gradient Descent
$$x_{k+1} = x_k - \alpha_k \nabla f_0(x_k).$$
Theorem (Convergence - $L$-smooth convex $f$). With $\alpha = 1/L$,
$$f(x_k) - f^\ast \leq \frac{L|x_0 - x^\ast|^2}{2k} = O(1/k).$$
Theorem (Convergence - $\mu$-strongly convex $f$). With $\alpha = 2/(\mu + L)$,
$$|x_k - x^\ast|^2 \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} |x_0 - x^\ast|^2 = O(\rho^k),$$
where $\kappa = L/\mu$ is the condition number. This is linear convergence - the log-error decreases by a constant each step.
The condition number $\kappa$ is the fundamental obstacle: elongated level sets (ill-conditioned Hessian) cause slow convergence. Preconditioning (approximately applying $H^{-1}$ to the gradient) reduces the effective condition number.
Nesterov Accelerated Gradient
Nesterov (1983) showed that a simple momentum-based modification achieves $O(1/k^2)$ for smooth convex problems - the optimal rate for first-order methods.
The update uses a look-ahead point:
$$y_{k+1} = x_k - \alpha \nabla f(x_k)$$ $$x_{k+1} = y_{k+1} + \frac{k-1}{k+2}(y_{k+1} - y_k).$$
Theorem (Nesterov, optimal first-order rate). For convex $f$ with $L$-Lipschitz gradient,
$$f(x_k) - f^\ast \leq \frac{2L|x_0 - x^\ast|^2}{(k+1)^2} = O(1/k^2).$$
This is optimal: no first-order method can achieve better than $O(1/k^2)$ on the class of smooth convex functions (Nesterov’s lower bound). For strongly convex $f$, the rate becomes $O(\rho^k)$ with $\rho = 1 - 1/\sqrt{\kappa}$ instead of $1 - 1/\kappa$ - a square-root improvement in condition number dependence.
Proximal Gradient Method
Many ML objectives have the form $f = g + h$ where $g$ is smooth convex (data fidelity) and $h$ is convex but non-smooth (regularizer). Example: LASSO has $g(x) = \frac{1}{2}|Ax - b|^2$ and $h(x) = \lambda|x|_1$.
The proximal gradient update is
$$x_{k+1} = \text{prox}_{\alpha h}\left(x_k - \alpha \nabla g(x_k)\right),$$
where $\text{prox}_{\alpha h}(v) = \arg\min_x {h(x) + \frac{1}{2\alpha}|x - v|^2}$.
For $h = \lambda|\cdot|1$, $\text{prox}{\alpha h}$ is entry-wise soft-thresholding: $(\text{prox}_{\alpha h}(v))_i = \text{sign}(v_i)\max(|v_i| - \alpha\lambda, 0)$.
Convergence: $O(1/k)$ for convex $g$, improved to $O(1/k^2)$ by FISTA (Fast ISTA - Nesterov acceleration applied to proximal gradient).
ADMM
The Alternating Direction Method of Multipliers (ADMM) solves problems of the form
$$\min_{x, z} f(x) + g(z) \quad \text{subject to} \quad Ax + Bz = c.$$
The augmented Lagrangian is $\mathcal{L}_\rho = f(x) + g(z) + \nu^T(Ax + Bz - c) + \frac{\rho}{2}|Ax + Bz - c|^2$.
ADMM alternates:
$$x_{k+1} = \arg\min_x \mathcal{L}\rho(x, z_k, \nu_k)$$ $$z{k+1} = \arg\min_z \mathcal{L}\rho(x{k+1}, z, \nu_k)$$ $$\nu_{k+1} = \nu_k + \rho(Ax_{k+1} + Bz_{k+1} - c).$$
Each subproblem is often much easier than the original (separable structure, closed form, etc.). ADMM is especially suited to distributed optimization: if the problem separates across data shards or network nodes, the $x$- and $z$-updates can be parallelized with only the dual variable $\nu$ communicated.
Convergence is $O(1/k)$ for convex problems; practical convergence is often fast and the method is robust to parameter tuning.
Subgradient Methods
For non-smooth $f$ with no exploitable structure (no proximal operator), subgradient methods apply:
$$x_{k+1} = x_k - \alpha_k g_k, \quad g_k \in \partial f(x_k).$$
With diminishing step sizes $\alpha_k = c/\sqrt{k}$, the best iterate satisfies
$$f\left(\frac{1}{k}\sum_{t=1}^k x_t\right) - f^\ast \leq \frac{c \cdot G}{\sqrt{k}},$$
where $G = \sup_k |g_k|$. Rate: $O(1/\sqrt{k})$, slower than smooth gradient descent.
Coordinate Descent
Coordinate descent minimizes over one coordinate (or block) at a time, cycling through all coordinates:
$$x_i^{k+1} = \arg\min_{x_i} f(x_1^{k+1}, \ldots, x_{i-1}^{k+1}, x_i, x_i^k, \ldots, x_n^k).$$
For separable or block-separable objectives, coordinate minimization is cheap. For LASSO, the coordinate update has a closed form (soft-thresholding), giving the fast coordinate descent LASSO algorithm (Friedman et al., 2010) that outperforms general solvers at large scale.
Convergence: $O(1/k)$ rate for smooth convex $f$ (with appropriate step sizes).
Applications
| Problem | Form | Algorithm |
|---|---|---|
| LASSO | $\min |Ax - b|^2 + \lambda|x|_1$ | Proximal gradient, coord. descent |
| Ridge regression | $\min |Ax - b|^2 + \lambda|x|^2$ | Closed form, gradient descent |
| SVM | $\min |w|^2$ s.t. margins | QP, dual with kernels |
| Logistic regression | $\min$ cross-entropy | Gradient descent, L-BFGS |
| Portfolio optimization | $\min x^T \Sigma x - \mu^T x$ s.t. $\mathbf{1}^T x = 1$, $x \geq 0$ | QP, interior point |
Examples
Convex solvers in practice. CVXPY (Python) and OSQP are the standard tools. CVXPY provides a disciplined convex programming interface: you express the problem using a library of convex atoms, and CVXPY verifies convexity via composition rules, then transforms the problem to a standard form and calls an appropriate solver (OSQP for QPs, SCS for cones, Mosek for SDPs). This removes the need to manually derive dual problems or implement algorithms.
Neural network training is not convex. The loss of a deep network is highly non-convex in the weights. The optimization landscape has saddle points (which gradient descent escapes with noise), flat regions, and possibly many local minima of similar quality. Despite this, the first-order tools from convex optimization (gradient descent, momentum, adaptive methods) work remarkably well. The convex theory provides the vocabulary - convergence rates, condition numbers, Lipschitz constants - that practitioners use to diagnose and fix training problems, even outside the regime where the theory is formally valid.
Read Next: