Convex Optimization - The Landscape Where Gradient Descent Always Wins // Megha Bose

Helpful context:

Here is a fact that sounds surprising at first: a large fraction of classical machine learning is convex optimization in disguise.

Logistic regression - convex. Support vector machines - convex. Ridge regression - convex. LASSO - convex. Maximum likelihood estimation of a Gaussian mixture (with known covariances and fixed assignments) - convex. Most of probabilistic graphical model inference via variational methods - convex. Many of the most important algorithms in statistics and machine learning are, at their core, solving a convex optimization problem, even if this is not immediately apparent from the way they are described.

Neural networks are not convex. But understanding convex optimization gives you the benchmark: you know what a reliably solvable problem looks like, you know what guarantees are available, and you know what you are giving up when you step into the non-convex world.

Section 1: Standard Form

Every convex optimization problem can be written in a standard form that makes its structure explicit.

Standard form convex program:

$$\min_{x \in \mathbb{R}^n} f_0(x) \quad \text{subject to} \quad f_i(x) \leq 0 ; (i = 1, \ldots, m), \quad Ax = b.$$

Three ingredients make this a convex program:

The objective $f_0$ is convex.
Each inequality constraint function $f_i$ is convex. (Convex functions satisfy $f_i(x) \leq 0$ on convex sets - the feasible region for each constraint is convex.)
The equality constraints $Ax = b$ are affine. (Affine functions are simultaneously convex and concave.)

The feasible set $\mathcal{F} = \{x : f_i(x) \leq 0 \text{ for all } i, Ax = b\}$ is the intersection of convex sets, hence convex. Combined with the convex objective: any local minimum is a global minimum.

This may seem too restrictive. What problems actually fit this template? A large and important class.

Section 2: A Hierarchy of Problem Classes

Convex programs form a hierarchy from most restricted to most general. Each level has specialized algorithms that exploit the additional structure. Knowing which class a problem belongs to determines which tools to reach for.

Linear Programs (LP). Minimize a linear objective subject to linear inequality and equality constraints:

$$\min_{x} c^T x \quad \text{subject to} \quad Ax \leq b, \quad Cx = d.$$

The objective $f_0(x) = c^T x$ is linear (hence convex). The constraints $a_i^T x \leq b_i$ are linear inequality constraints. The feasible set is a polytope - the intersection of finitely many halfspaces, a convex polyhedron.

The optimal solution of an LP always occurs at a vertex of the polytope (if a finite optimum exists). The Simplex method exploits this: it moves from vertex to vertex, decreasing the objective at each step, and terminates at the optimal vertex. Interior-point methods traverse the interior instead. Both run in polynomial time in theory; the simplex method is usually faster in practice.

Linear programming is the backbone of operations research: resource allocation, transportation and network problems, production planning, diet problems, least-cost scheduling. The airline industry uses LP solvers constantly.

Quadratic Programs (QP). A quadratic (convex) objective, linear constraints:

$$\min_{x} \frac{1}{2}x^T P x + q^T x + r \quad \text{subject to} \quad Gx \leq h, \quad Ax = b,$$

where $P \succeq 0$ (positive semidefinite). The squared loss $|y - Xw|^2$ with linear constraints is a QP in $w$. The SVM (with the objective $\frac{1}{2}|w|^2$ and linear margin constraints) is a QP. Portfolio optimization (minimize variance $w^T \Sigma w$ subject to return $\mu^T w \geq r$ and budget $\sum w_i = 1$) is a QP.

QPs are more expressive than LPs (the optimal solution is not necessarily at a vertex of the feasible polytope), and solvers are slower per problem, but still efficient.

Second-Order Cone Programs (SOCP). Allow constraints involving second-order (Euclidean) norms. More general than QP. Used in robust optimization (worst-case performance over an uncertainty set), antenna beam forming, and portfolio problems with transaction costs.

Semidefinite Programs (SDP). The most expressive standard convex class. Constraints can involve positive semidefinite matrix inequalities $X \succeq 0$. LPs, QPs, and SOCPs are all special cases of SDPs.

SDPs arise in:

Control theory (Lyapunov stability conditions)
Combinatorial approximation (the Goemans-Williamson MAX-CUT relaxation achieves a 0.878 approximation ratio by relaxing an integer QP to an SDP)
Sum-of-squares polynomial optimization (bounding the minimum of a polynomial)
Matrix completion (nuclear norm minimization as a convex relaxation of rank minimization)

The hierarchy matters computationally. LP is the fastest class - interior-point LP solvers handle millions of variables. SDP solvers are orders of magnitude slower - problems with more than a few thousand variables require specialized structure. Knowing your problem class is knowing your computational budget.

Section 3: Duality

Every convex optimization problem has a companion - the dual problem. Understanding duality is understanding one of the deepest structural features of optimization.

Building the dual. Start with the primal:

$$p^\ast = \min_x f_0(x) \quad \text{subject to} \quad f_i(x) \leq 0, ; Ax = b.$$

Form the Lagrangian with dual variables $\lambda \geq 0$ (for inequality constraints) and $\nu$ (unconstrained, for equality constraints):

$$\mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \nu^T(Ax - b).$$

The dual function minimizes the Lagrangian over $x$:

$$g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu).$$

Why $g(\lambda, \nu) \leq p^\ast$. For any feasible $x$ (satisfying all constraints) and any $\lambda \geq 0$:

$$\mathcal{L}(x, \lambda, \nu) = f_0(x) + \underbrace{\sum_i \lambda_i f_i(x)}{\leq 0 \text{ since } f_i \leq 0, \lambda_i \geq 0} + \underbrace{\nu^T(Ax-b)}{= 0 \text{ since feasible}} \leq f_0(x).$$

So $g(\lambda, \nu) = \inf_x \mathcal{L} \leq \mathcal{L}(x^\ast, \lambda, \nu) \leq f_0(x^\ast) = p^\ast$.

The dual function provides a lower bound on the primal optimum for any choice of $\lambda \geq 0$ and $\nu$. The dual problem finds the best (tightest) lower bound:

$$d^\ast = \max_{\lambda \geq 0, \nu} g(\lambda, \nu).$$

Weak duality: $d^\ast \leq p^\ast$. Always true.

Strong duality: $d^\ast = p^\ast$. Holds for convex problems under Slater’s condition.

Discomfort check. Weak duality always holds. Strong duality requires conditions. What conditions, and why?

Slater’s condition: there exists a strictly feasible point - a point $\hat{x}$ with $f_i(\hat{x}) < 0$ for all $i$ and $A\hat{x} = b$ (strictly satisfying all inequality constraints, with slack). Strict feasibility means the feasible region is not degenerate (not just a single boundary point). When Slater’s condition holds, the dual gap $p^\ast - d^\ast$ is zero.

Why does strict feasibility close the gap? The intuition comes from separation theorems: at the optimum of the primal, if the only feasible point is a corner of the feasible region, the dual might not be able to “see” the optimum from below. Strict feasibility gives the feasible region enough interior that the dual function can accurately track the primal.

For linear programs, Slater’s condition reduces to primal feasibility (since all constraints are linear, any feasible point is strictly feasible after perturbation). Strong duality holds for any feasible LP with a finite optimum.

Section 4: Why Duality Matters

Duality is not just an abstract theorem. It has three concrete applications.

The dual is sometimes easier. Consider an LP with $n = 10^6$ variables and $m = 100$ constraints. The dual LP has $m = 100$ variables and $n = 10^6$ constraints - much smaller. Solving the dual and reading off the primal solution is often faster.

Duality provides certificates. Suppose you find a primal-feasible point $\bar{x}$ with $f_0(\bar{x}) = p$ and a dual-feasible point $(\bar{\lambda}, \bar{\nu})$ with $g(\bar{\lambda}, \bar{\nu}) = d$, and $p - d \leq \epsilon$. This is a duality gap certificate: the primal solution is within $\epsilon$ of optimal. No trust required. The gap $p - d$ bounds the suboptimality without knowing the true $p^\ast$.

This is crucial in practice. When you run an interior-point solver, it computes both a primal and dual iterate and monitors the duality gap. When the gap falls below a threshold, the solver terminates with a provably near-optimal solution.

The dual encodes the shadow prices. The optimal dual variables $\lambda_i^\ast$ are the Lagrange multipliers at the primal optimum. As discussed in the previous post, $\lambda_i^\ast = -\partial p^\ast / \partial c_i$ (the derivative of the optimal value with respect to the constraint level). The dual solution tells you the marginal value of relaxing each constraint - the shadow prices. This is essential for sensitivity analysis: “if I increase my budget by $\delta$, how much does my optimal solution improve?”

Section 5: Subgradients

All of the theory so far has assumed smooth objectives. But many important convex functions are not differentiable.

The L1 norm $|x|_1$ is non-differentiable along coordinate hyperplanes.
The hinge loss $\max(0, 1-t)$ has a kink at $t = 1$.
The maximum of smooth functions is smooth almost everywhere but has kinks where the argmax changes.
The nuclear norm $|A|_* = \sum_i \sigma_i(A)$ (sum of singular values) is non-differentiable.

For non-smooth functions, gradients do not exist everywhere. But a weaker object does: the subgradient.

Definition. A vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at $x$ if for all $y$:

$$f(y) \geq f(x) + g \cdot (y - x).$$

The subgradient condition says: $g$ gives a supporting hyperplane at $x$ that lies everywhere below the graph of $f$. This is the first-order convexity condition for smooth functions, but extended to non-differentiable points.

The set of all subgradients at $x$ is the subdifferential $\partial f(x)$.

Example: $f(x) = |x|$. At $x > 0$: $f$ is differentiable with $f'(x) = 1$, so $\partial f(x) = \{1\}$.

At $x < 0$: $f'(x) = -1$, so $\partial f(x) = \{-1\}$.

At $x = 0$: any $g \in [-1, 1]$ satisfies $|y| \geq 0 + g \cdot (y - 0) = gy$ for all $y$ (since $|y| \geq |g||y| \geq gy$ when $|g| \leq 1$). So $\partial f(0) = [-1, 1]$ - the entire interval.

The optimality condition. For a convex function, $x^\ast$ is a global minimum if and only if $0 \in \partial f(x^\ast)$. At smooth points, this is $\nabla f(x^\ast) = 0$. At non-smooth points, the zero vector must be in the subdifferential.

For $f(x) = |x|$: $0 \in \partial f(0) = [-1, 1]$. Yes - so $x^\ast = 0$ is the global minimum. Correct.

Discomfort check. The subdifferential at a non-differentiable point is an entire set, not a single vector. Which subgradient do you use in an algorithm?

Any subgradient works for the optimality check: $0 \in \partial f(x^\ast)$ tests whether the zero vector is anywhere in the subdifferential. For optimization algorithms, you pick any element of the subdifferential at each step. The subgradient method is: pick any $g_t \in \partial f(x_t)$, take the step $x_{t+1} = x_t - \eta_t g_t$. Convergence is guaranteed (with decaying step sizes $\eta_t$) but the rate is $O(1/\sqrt{t})$ - slower than gradient descent’s $O(1/t)$ for smooth objectives. Non-smoothness has a real computational cost.

Section 6: Proximal Methods

For objectives of the form $h(x) = f(x) + g(x)$, where $f$ is smooth and convex but $g$ is convex and non-smooth, neither pure gradient descent (does not handle non-smoothness) nor the subgradient method (ignores the smooth structure of $f$) is ideal. Proximal gradient methods split the problem and handle each part differently.

The central operation is the proximal operator of $g$:

$$\text{prox}_{g}(v) = \underset{x}{\arg\min} \left\{ \frac{1}{2}|x - v|^2 + g(x) \right\}.$$

The proximal operator finds the point minimizing a tradeoff: stay close to $v$ (the quadratic penalty) while also keeping $g$ small. It is a “soft projection” - like projecting $v$ onto a lower level set of $g$, but softly rather than hard.

The proximal gradient step is:

Gradient step on $f$: $v = x_t - \eta \nabla f(x_t)$.
Proximal step on $g$: $x_{t+1} = \text{prox}_{\eta g}(v)$.

The proximal step is applied with a scaled version of $g$: $\text{prox}_{\eta g}(v) = \arg\min_x \{ \frac{1}{2}|x-v|^2 + \eta g(x) \}$.

The key insight: step 2 solves a small optimization problem exactly (in closed form, for many important choices of $g$). Non-smooth objectives with structure can be handled exactly rather than approximately.

Convergence. For convex $f$ (smooth, $L$-smooth) and convex $g$ (possibly non-smooth), proximal gradient with step size $1/L$ converges at $O(1/t)$ - the same rate as gradient descent on smooth objectives. The non-smooth part does not slow things down, because the proximal operator resolves it exactly at each step.

Section 7: Soft Thresholding and LASSO

The most important proximal operator in machine learning is the proximal operator of the L1 norm: $g(x) = \lambda|x|_1$.

For a scalar $v$, the proximal problem is:

$$\text{prox}_{\eta\lambda|x|}(v) = \underset{x}{\arg\min} \left\{ \frac{1}{2}(x - v)^2 + \eta\lambda|x| \right\}.$$

The objective is: pay $(x-v)^2/2$ to move away from $v$, and pay $\eta\lambda|x|$ for the size of $x$. The optimal tradeoff has a closed form.

For $v > 0$: the minimum is at $x = v - \eta\lambda$ if $v > \eta\lambda$, and at $x = 0$ if $v \leq \eta\lambda$. For $v < 0$: by symmetry, $x = v + \eta\lambda$ if $|v| > \eta\lambda$, and $x = 0$ if $|v| \leq \eta\lambda$. For $v = 0$: $x = 0$.

In one formula:

$$\text{prox}_{\eta\lambda|\cdot|}(v) = \text{sign}(v) \cdot \max(|v| - \eta\lambda, 0).$$

This is soft thresholding. Values with magnitude less than $\eta\lambda$ are set to exactly zero. Values with magnitude larger than $\eta\lambda$ are moved toward zero by exactly $\eta\lambda$. For a vector $v \in \mathbb{R}^n$, soft thresholding applies coordinate-wise.

The LASSO problem is:

$$\min_w \frac{1}{2}|y - Xw|^2 + \lambda|w|_1.$$

This is $h(w) = f(w) + g(w)$ with $f(w) = \frac{1}{2}|y-Xw|^2$ (smooth, strongly convex) and $g(w) = \lambda|w|_1$ (non-smooth). Proximal gradient gives:

Gradient step on $f$: $v = w - \eta X^T(Xw - y)$.
Soft threshold: $w_{\text{new}} = \text{sign}(v) \cdot \max(|v| - \eta\lambda, 0)$ (coordinate-wise).

Repeat. This is the iterative shrinkage thresholding algorithm (ISTA), the canonical algorithm for LASSO. Each iteration is $O(mn)$ for an $m \times n$ design matrix $X$, and convergence is $O(1/t)$ in objective error.

Section 8: Why L1 Promotes Sparsity

L1 and L2 regularization are both convex. Both shrink coefficients toward zero. But the L1 penalty produces exact zeros (sparse solutions) while L2 does not. Understanding why requires looking at the geometry.

The constraint form. The regularized problem $\min_w \ell(w) + \lambda R(w)$ is equivalent to $\min_w \ell(w)$ subject to $R(w) \leq t$ (for appropriate $t$ depending on $\lambda$). The regularizer acts as a constraint on the norm of $w$.

For L2: the feasible set $\{w : |w|^2 \leq t\}$ is a ball. Level curves of a generic loss function $\ell$ are smooth ellipsoids. The point where the loss is minimized on the ball boundary is generically a smooth point on the sphere, not on any coordinate axis. The optimal $w$ has all nonzero components.

For L1: the feasible set $\{w : |w|_1 \leq t\}$ is a diamond (in 2D) or a cross-polytope (in higher dimensions). This set has corners on the coordinate axes - points like $(t, 0, \ldots, 0)$ and permutations. When you minimize a generic smooth loss on this set, the minimum generically occurs at a corner, because the loss gradient can be “parallel” to the flat faces of the diamond (which forces the minimum to the adjacent corner). Corners have many coordinates equal to zero. The solution is sparse.

Algebraically: the subdifferential of $|w_j|$ at $w_j = 0$ is the entire interval $[-1, 1]$. The optimality condition $0 \in \partial h$ allows the gradient of the loss with respect to $w_j$ to be any value in $[-\lambda, \lambda]$ while still giving $w_j^\ast = 0$. A whole range of loss gradients is compatible with exact zeros. For L2, the subdifferential at $0$ is just $\{0\}$, so only exactly zero loss gradient produces $w_j^\ast = 0$, which generically does not happen.

Discomfort check. If L1 produces sparse solutions, should you always prefer L1 over L2?

No. The right choice depends on your beliefs about the true model:

If you believe many features are irrelevant (the true weight vector is sparse): use L1. LASSO will zero out the irrelevant features, producing an interpretable, sparse model.

If you believe all features contribute but no single feature dominates (the true weight vector is dense): use L2. Ridge regression shrinks all weights proportionally, giving stable predictions even with correlated features.

If you have correlated features: L2 tends to give similar weights to correlated features; L1 tends to pick one and ignore the others (arbitrary choice). Elastic net ($\alpha$L1 + $(1-\alpha)$L2) combines both.

The sparsity of L1 is a feature when you want model selection (which features matter?), and a drawback when you want stable prediction with correlated features.

Section 9: Coordinate Descent

Gradient descent updates all variables simultaneously. Coordinate descent updates one variable at a time:

$$x_j^{\text{new}} = \underset{x_j}{\arg\min} ; h(x_1, \ldots, x_{j-1}, x_j, x_{j+1}, \ldots, x_n),$$

with all other coordinates held fixed. Cycle through $j = 1, 2, \ldots, n, 1, 2, \ldots$ and repeat.

For some problems, each coordinate update has a simple closed form. The LASSO coordinate update for coordinate $j$:

$$w_j^{\text{new}} = \text{sign}(r_j) \cdot \max(|r_j| - \lambda, 0),$$

where $r_j = y_j^T (y - X_{-j} w_{-j}) / |x_j|^2$ is the partial residual after removing the contribution of all other coefficients ($X_{-j}$ is $X$ without column $j$, $w_{-j}$ is $w$ without $w_j$).

Each coordinate update is $O(m)$ (one dot product). A full cycle through all $n$ coordinates is $O(mn)$ - the same cost as one iteration of ISTA, but coordinate descent often converges in far fewer cycles for sparse solutions (it exploits the structure of which coordinates are already at zero).

For strictly convex differentiable objectives: coordinate descent converges to the global minimum.

For separable non-smooth terms like L1: coordinate descent with soft thresholding at each step converges to the global minimum (the objective $|y - Xw|^2 + \lambda|w|_1$ is convex; the non-smooth part $\lambda|w|_1 = \lambda\sum_j |w_j|$ is separable).

Coordinate descent is the method of choice for large-scale LASSO and Elastic Net. Popular implementations (glmnet in R, sklearn’s Lasso in Python) use coordinate descent with warm starting and active set tracking to skip coordinates that are already at zero.

Section 10: First-Order Method Convergence Summary

For large-scale problems where second-order methods are too expensive, first-order methods are the practical choice. Here is a summary of the main convergence rates:

Method	Problem Class	Rate
Gradient descent	Convex, $L$-smooth	$O(1/t)$
Gradient descent	$\mu$-strongly convex, $L$-smooth	$O(\rho^t)$, $\rho = 1 - \mu/L$
Nesterov accelerated	Convex, $L$-smooth	$O(1/t^2)$
Nesterov accelerated	$\mu$-strongly convex, $L$-smooth	$O(\rho^t)$, $\rho = 1 - \sqrt{\mu/L}$
Subgradient method	Convex, non-smooth	$O(1/\sqrt{t})$
Proximal gradient (ISTA)	Composite, $L$-smooth + non-smooth	$O(1/t)$
Accelerated proximal (FISTA)	Composite, $L$-smooth + non-smooth	$O(1/t^2)$

The $O(1/t^2)$ rate of Nesterov acceleration is optimal among first-order methods for smooth convex objectives - no algorithm using only gradient information can do better. This is a lower bound result: Nesterov proved that any first-order method requires at least $\Omega(1/t^2)$ iterations on some smooth convex function.

The comparison:

To achieve error $\epsilon$ with gradient descent: $O(1/\epsilon)$ steps.
To achieve error $\epsilon$ with Nesterov: $O(1/\sqrt{\epsilon})$ steps.
For $\epsilon = 10^{-6}$: gradient descent needs $10^6$ steps, Nesterov needs $10^3$ steps.

In practice, the difference is large. FISTA (Fast ISTA, which applies Nesterov acceleration to proximal gradient) is the standard first-order method for large-scale L1 regularization.

Section 11: The Convex Optimization Workflow

When you face an optimization problem, here is the structured approach:

Step 1: Is it convex? Verify the objective and constraint functions. Use the composition rules: non-negative combination, composition with affine, pointwise max. If convex - you are in the regime with guarantees.

Step 2: What class is it? Identify the problem as LP, QP, SOCP, or SDP. Use a modeling framework (CVX, CVXPY, JuMP) that automatically selects the right solver.

Step 3: What is the scale? Small ($n < 1000$): use interior-point methods. Interior-point solvers converge to machine precision in around 50 iterations using Newton steps. Large ($n > 10^6$): use first-order methods. Slower per unit accuracy but cheap per iteration.

Step 4: Is there special structure? Composite objective ($f + g$)? Use proximal methods. Separable structure? Use coordinate descent. Sparsity in data ($X$ is sparse)? Exploit this in gradient and proximal computations.

Step 5: If non-convex: Name the non-convexity. Is there a convex relaxation (replace rank constraint with nuclear norm, replace $\{0,1\}$ with $[0,1]$)? Is the problem “almost convex” (nearly tight in strong duality)? Can you initialize well enough that a local method finds a good solution?

Convex optimization is both a theoretical framework and an engineering discipline. The theory tells you which problems are tractable and what guarantees are available. The engineering tells you which algorithms, implemented how, with which solver, will actually solve your problem in reasonable time. Mastery requires both.

Read Next:

The Modern Optimizer Zoo - Every Variant of Gradient Descent You Didn’t Ask For