Helpful context:


Train a linear regression model on any dataset. Gradient descent finds the minimum of the squared loss every single time, no matter where you start. Now train a neural network. Same algorithm, same basic idea - but now it matters enormously where you start, what your learning rate is, and whether you get unlucky with initialization. Sometimes it converges to a poor solution. Sometimes it does not converge at all.

Why the difference? Both problems involve minimizing a loss function over parameters. Both use gradient descent. The answer is one word: convexity. Linear regression has a convex loss. Neural networks do not. Convexity is the property that transforms a search problem - where are all the valleys, and which one is deepest? - into something you can solve reliably and certifiably.

This post is about what convexity is, why it is the right condition to care about, and what guarantees it provides.


Section 1: The Shape of a Bowl

Before any formulas, look at two functions: $f(x) = x^2$ and $g(x) = \sin(x)$.

The parabola $x^2$ has one valley. It goes down, reaches the bottom at $x = 0$, and comes back up forever. If you put a ball anywhere on this curve, it rolls to the same place. There is one minimum and only one.

The function $\sin(x)$ waves up and down forever. Every trough is a local minimum - a place where the function is smaller than its immediate neighbors - but none of them is the global minimum, because the sine function has no global minimum at all on $\mathbb{R}$. If you put a ball on $\sin(x)$, it rolls into whichever trough happens to be nearby. Different starting positions, different resting places.

The distinction is convexity. The parabola is convex. The sine curve is not.

Here is the geometric test that separates them. Pick any two points on the parabola - say $(-2, 4)$ and $(2, 4)$. Draw the straight line segment connecting them. The segment passes through $(0, 4)$. But the parabola at $x = 0$ is at height $0$. The segment lies above the curve between the two chosen points.

Now try the same test with $\sin(x)$. Pick $(0, 0)$ and $(\pi, 0)$. The line segment connecting them is the horizontal line at height $0$. The sine curve between those $x$-values rises up to $1$ at $x = \pi/2$. The curve lies above the segment - the chord dips below the function. Convexity fails.

This is the convexity condition in geometric form: a function is convex if every chord (line segment connecting any two points on the graph) lies above or on the graph. Convex means bowl-shaped, with no interior dips. Non-convex means there exists at least one chord that passes below the function somewhere between its endpoints.


Section 2: The Formal Definition

Now that the geometry is clear, here is the precise condition.

Definition. A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for all $x, y$ in its domain and all $\lambda \in [0, 1]$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y).$$

Read this carefully. The expression $\lambda x + (1-\lambda)y$ is a convex combination of $x$ and $y$: when $\lambda = 1$ you get $x$; when $\lambda = 0$ you get $y$; when $\lambda = 1/2$ you get the midpoint. As $\lambda$ varies over $[0, 1]$, you trace out the entire line segment from $x$ to $y$.

The left side is $f$ evaluated at that intermediate point. The right side is the corresponding point on the chord: a weighted average of $f(x)$ and $f(y)$ with the same weights. The inequality says: the function value at any intermediate point is at most as large as the corresponding height on the chord. The chord lies above (or on) the function.

The function is strictly convex if the inequality is strict for $x \neq y$ and $\lambda \in (0, 1)$: the chord lies strictly above the function, never touching except at the endpoints. Strict convexity rules out flat regions and ensures a unique minimum.

Discomfort check. Convex function vs. convex set - different things, related but distinct.

A convex set $C$ is a set where the line segment between any two points in $C$ stays inside $C$. A disk is convex. A crescent shape is not - you can find two points such that the segment between them passes outside the crescent. A convex set is a geometric containment property.

A convex function is a function where chords lie above the graph. The two concepts connect in one precise way: a function $f$ is convex if and only if its epigraph - the set of points on or above the graph, $\{(x, t) : t \geq f(x)\}$ - is a convex set. But in general, a convex function and a convex set are different kinds of mathematical objects. You can have a convex function defined on a non-convex domain, and a convex set that has nothing to do with any function. Keep them separate.


Section 3: Three Equivalent Conditions for Smooth Functions

For smooth functions, two additional characterizations of convexity are often more useful than the definition.

The gradient inequality. Suppose $f$ is differentiable. Then $f$ is convex if and only if, for all $x$ and $y$:

$$f(y) \geq f(x) + \nabla f(x) \cdot (y - x).$$

In words: the function lies above all its tangent planes (or tangent lines in one dimension). If you build the linear approximation to $f$ at any point $x$ - the tangent plane - then $f$ is everywhere at or above that plane. The tangent plane supports the function from below.

This characterization is powerful for optimization. If $\nabla f(x^\ast) = 0$ and $f$ is convex, then the gradient inequality gives $f(y) \geq f(x^\ast) + 0 \cdot (y - x^\ast) = f(x^\ast)$ for all $y$. A stationary point of a convex function is a global minimum. No second-derivative check needed. No additional verification. The zero gradient combined with convexity is sufficient.

The Hessian condition. Suppose $f$ is twice differentiable. Then $f$ is convex if and only if the Hessian matrix $H_f(x)$ is positive semidefinite at every point $x$:

$$H_f(x) \succeq 0 \quad \text{for all } x.$$

In one dimension, this reduces to $f''(x) \geq 0$ for all $x$. The second derivative measures how the slope is changing - positive means the slope is increasing, which means the function bends upward. Non-negative curvature everywhere is the analytical version of bowl-shapedness.

For a vector-valued input, the Hessian is the matrix of all second-order partial derivatives. Positive semidefinite means every directional second derivative is non-negative: $v^T H_f(x) v \geq 0$ for all directions $v$. The function bends upward (or stays flat) in every direction.

Checking the examples:

$f(x) = x^2$: $f''(x) = 2 > 0$ everywhere. Convex.

$f(x) = \sin(x)$: $f''(x) = -\sin(x)$, which equals $-1$ at $x = \pi/2$. Not convex on all of $\mathbb{R}$.

$f(x) = e^x$: $f''(x) = e^x > 0$ everywhere. Convex.

$f(x) = -\ln(x)$ for $x > 0$: $f''(x) = 1/x^2 > 0$. Convex.

$f(x) = |x|$: not differentiable at $0$, but the chord condition holds everywhere. Convex.


Section 4: Why Local Equals Global - the Central Theorem

This is the key theorem of convex optimization. Everything else - convergence guarantees, the value of convex formulations, the tractability of these problems - follows from this.

Theorem. If $f$ is convex and $x^\ast$ is a local minimum - meaning $f(x^\ast) \leq f(x)$ for all $x$ in some open neighborhood of $x^\ast$ - then $x^\ast$ is a global minimum: $f(x^\ast) \leq f(x)$ for all $x$.

Before proving it, make sure you feel why this should not be obvious. A local minimum is just a place where nearby points are worse. Nothing in the definition of local minimum prevents a faraway valley from being much deeper. That is exactly what happens with $\sin(x)$ or with neural network losses - local minima exist that are not global minima. The claim is that for convex functions, this cannot happen.

Proof. Suppose for contradiction that $x^\ast$ is a local minimum but not a global minimum. Then there exists $y$ with $f(y) < f(x^\ast)$.

Consider points along the segment from $x^\ast$ to $y$: let $x_\lambda = (1-\lambda)x^\ast + \lambda y$ for $\lambda \in [0, 1]$.

By the convexity inequality:

$$f(x_\lambda) = f((1-\lambda)x^\ast + \lambda y) \leq (1-\lambda)f(x^\ast) + \lambda f(y).$$

Since $f(y) < f(x^\ast)$, we have $(1-\lambda)f(x^\ast) + \lambda f(y) < (1-\lambda)f(x^\ast) + \lambda f(x^\ast) = f(x^\ast)$.

So $f(x_\lambda) < f(x^\ast)$ for every $\lambda \in (0, 1]$.

Now, for small enough $\lambda > 0$, the point $x_\lambda = (1-\lambda)x^\ast + \lambda y$ is arbitrarily close to $x^\ast$. The line segment from $x^\ast$ toward $y$ passes through the neighborhood of $x^\ast$ that certifies the local minimum, and every point on this segment (other than $x^\ast$ itself) has smaller function value than $x^\ast$.

This contradicts $x^\ast$ being a local minimum. The assumption that a better point $y$ exists is false. So $x^\ast$ is a global minimum. $\square$

The proof shows exactly what convexity contributes. The chord condition forces the function to decrease along the segment from $x^\ast$ toward any supposedly better point $y$. This creates a descent path starting at $x^\ast$ - but local minima have no descent direction in any nearby direction. The only resolution: no better point $y$ exists.

Discomfort check. Why does this fail for non-convex functions?

For $g(x) = x^4 - x^2$: compute $g'(x) = 4x^3 - 2x = 2x(2x^2 - 1)$. Critical points at $x = 0$ and $x = \pm 1/\sqrt{2}$. The second derivative $g''(x) = 12x^2 - 2$ gives $g''(0) = -2 < 0$ (local max at $0$) and $g''(\pm 1/\sqrt{2}) = 4 > 0$ (local minima at $\pm 1/\sqrt{2}$). Both local minima give $g(\pm 1/\sqrt{2}) = 1/4 - 1/2 = -1/4$. Equal by symmetry, so neither is strictly worse.

If you perturb the function slightly - say $g(x) = x^4 - x^2 + 0.1x$ - the symmetry breaks and one local minimum becomes strictly better than the other. Gradient descent starting from a positive initial point finds the right-side minimum; starting from a negative point finds the left-side minimum. The chord condition fails for this function (you can find two points where the chord dips below), which is why multiple local minima exist and the theorem does not apply.


Section 5: Examples of Convex Functions

Building fluency with convexity means knowing the main examples and understanding why each one is convex.

$f(x) = x^2$. Convex. Second derivative is $2 > 0$ everywhere. The prototypical convex function. The squared loss in linear regression is a sum of terms of this form, which is why linear regression has a convex loss.

$f(x) = e^x$. Convex. $f''(x) = e^x > 0$ everywhere. The exponential curves upward at every point. Exponentials appear in softmax outputs, cross-entropy loss, and log-sum-exp, all of which inherit this convexity.

$f(x) = |x|$. Convex but not differentiable at $x = 0$. Piecewise linear with slopes $-1$ and $+1$. The chord condition holds: for any two points on the graph of $|x|$, the segment connecting them lies above the graph. This is the L1 norm in one dimension, and it is the key to LASSO and sparse optimization.

$f(x) = \max(0, x)$ (ReLU). Convex. The maximum of two convex functions ($0$ and $x$) is convex. Piecewise linear: flat for $x < 0$, slope $1$ for $x > 0$. The fact that ReLU is convex does not make neural networks with ReLU activations convex - the composition with weight matrices breaks convexity.

$f(x) = -\ln(x)$ for $x > 0$. Strictly convex. Second derivative $1/x^2 > 0$. Appears in maximum likelihood estimation: maximizing $\log p$ is equivalent to minimizing $-\log p$, and the negative log is convex. Cross-entropy loss is a sum of terms of this form.

$f(\mathbf{x}) = |\mathbf{x}|^2 = \sum_i x_i^2$. Strictly convex. The Hessian is $2I$, which is positive definite. This is the L2 regularization term in ridge regression. Its strict convexity is why ridge regression always has a unique solution.

$f(\mathbf{x}) = |\mathbf{x}|_1 = \sum_i |x_i|$. Convex but non-differentiable along coordinate hyperplanes. The L1 norm is a sum of convex functions, hence convex. LASSO adds this to the squared loss and the sum is still convex.


Section 6: Examples of Non-Convex Functions

$f(x) = \sin(x)$ on $\mathbb{R}$. Non-convex. Second derivative $-\sin(x)$ is negative at $x = \pi/2$. Oscillates, creating infinitely many local minima.

$f(x) = x^4 - x^2$. Non-convex. Has two symmetric local minima near $x = \pm 1/\sqrt{2}$, with a local maximum at $x = 0$. Second derivative $12x^2 - 2$ is negative in the interval $(-1/\sqrt{6}, 1/\sqrt{6})$.

$f(x) = x^3$. Neither convex nor concave on all of $\mathbb{R}$. Convex for $x > 0$ (second derivative $6x > 0$) and concave for $x < 0$ (second derivative $6x < 0$).

Neural network losses. The quintessential non-convex problem in modern machine learning. Why?

A neural network computes a composition of affine maps and nonlinearities:

$$F(\mathbf{x}; W_1, \ldots, W_L) = W_L \sigma(\cdots \sigma(W_2 \sigma(W_1 \mathbf{x})) \cdots).$$

The loss is then some function of this output compared to the labels. Even a single hidden layer creates non-convexity. Consider a two-layer network with one hidden unit: the output is $w_2 \sigma(w_1 x)$. Swapping $w_1 \to -w_1$ and $w_2 \to -w_2$ gives the same output if $\sigma$ is an odd function. This means the loss has two identical global minima with different weight configurations - violating strict convexity immediately.

More generally, permuting the hidden neurons (and correspondingly adjusting the output weights) leaves the function invariant. An $n$-neuron hidden layer has $n!$ equivalent weight configurations giving the same loss value. The loss landscape has a complicated symmetry structure with exponentially many equivalent local minima and vast numbers of saddle points.

Yet gradient descent works on neural networks in practice. How? In high dimensions, most local minima are approximately equally good, and saddle points are much more common than strict local minima. The optimization landscape is non-convex but not adversarial - gradient descent gets stuck at saddle points less often than feared, and the local minima it finds are usually close in value to the global minimum.


Section 7: Operations That Preserve Convexity

You rarely need to verify convexity from scratch. Instead, you recognize convex building blocks and use operations that preserve convexity.

Non-negative linear combinations. If $f_1, \ldots, f_k$ are convex and $\alpha_1, \ldots, \alpha_k \geq 0$, then $\sum_i \alpha_i f_i$ is convex. The squared loss $\sum_i (y_i - \hat{y}_i)^2$ is convex because it is a sum (non-negative combination) of functions $r_i^2$, each of which is convex in $r_i$. Combined with the affine dependence on parameters, the entire squared loss is convex in the parameters.

Composition with an affine map. If $f$ is convex and $A\mathbf{x} + b$ is an affine function of $\mathbf{x}$, then $g(\mathbf{x}) = f(A\mathbf{x} + b)$ is convex in $\mathbf{x}$. The squared loss $|\mathbf{y} - X\mathbf{w}|^2$ is convex in $\mathbf{w}$ because it is the composition of the convex function $|\cdot|^2$ with the affine map $\mathbf{w} \mapsto \mathbf{y} - X\mathbf{w}$.

Pointwise maximum. If $f_1, \ldots, f_k$ are convex, then $h(\mathbf{x}) = \max(f_1(\mathbf{x}), \ldots, f_k(\mathbf{x}))$ is convex. The hinge loss $\max(0, 1 - y\hat{y})$ is the max of two linear (hence convex) functions, so it is convex in $\hat{y}$.

Monotone composition. If $f$ is convex and nondecreasing, and $g$ is convex, then $f(g(\mathbf{x}))$ is convex. Example: since $e^t$ is convex and nondecreasing, and $f(\mathbf{x})$ is convex, the function $e^{f(\mathbf{x})}$ is convex.

Using these rules, you can read off the convexity of logistic regression’s cross-entropy loss in seconds: it is a sum of terms $-y_i \log \sigma(\mathbf{w}^T \mathbf{x_i}) - (1-y_i)\log(1-\sigma(\mathbf{w}^T \mathbf{x_i}))$. Each term is a composition of convex functions with an affine map - convex by the composition rule. The sum of convex functions is convex. No Hessian computation needed.


Section 8: Convex Sets

Optimization problems rarely ask you to minimize over all of $\mathbb{R}^n$. They impose constraints: the parameters must satisfy some feasibility condition. The set of feasible points is the feasible set. For the theory of convex optimization to work cleanly, this set must be convex.

Definition. A set $C \subseteq \mathbb{R}^n$ is convex if for any two points $x, y \in C$ and any $\lambda \in [0, 1]$, the convex combination $\lambda x + (1-\lambda)y \in C$.

Geometrically: any line segment connecting two points inside the set stays inside the set. A disk is convex. A crescent, a doughnut, or any set with a dent or hole is not.

Why convex feasible sets matter. If you are minimizing a convex function over a convex feasible set, the local-equals-global theorem still holds. Any point that is locally optimal (no feasible improvement nearby) is globally optimal. Gradient descent and its variants still converge to the global optimum when both the objective and the feasible set are convex.

If the feasible set is non-convex - integers, for instance, or a union of disconnected regions - then even a convex objective can have local optima that are not global. Integer programming is NP-hard precisely because the feasible set (integer lattice) is not convex.

Examples of convex sets:

  • Halfspaces: $\{x : a^T x \leq b\}$. The feasible region of each linear constraint in an LP is a halfspace.

  • Balls: $\{x : |x| \leq r\}$. The constraint $|\mathbf{w}| \leq r$ used in regularized optimization.

  • The positive orthant: $\{x : x_i \geq 0 \text{ for all } i\}$. Non-negativity constraints.

  • The probability simplex: $\{p \geq 0 : \sum_i p_i = 1\}$. The feasible set for probability distributions.

  • Intersections of convex sets. The intersection of any collection of convex sets is convex. This is why linear programming feasible regions (intersections of halfspaces) are convex.

Examples of non-convex sets:

  • The integer lattice $\mathbb{Z}^n$. No line segment between two different lattice points stays on the lattice.

  • A sphere surface: $\{x : |x| = r\}$. Two antipodal points have a segment passing through the interior.

  • The set of rank-$k$ matrices: $\{A \in \mathbb{R}^{m \times n} : \text{rank}(A) \leq k\}$. Non-convex, which makes low-rank matrix completion hard.


Section 9: Strong Convexity

Convexity guarantees that any local minimum is global. But it says nothing about how quickly gradient descent finds that minimum, or even whether the minimum is unique.

Consider $f(x) = \max(0, |x| - 1)$ - a convex function that is exactly zero on the entire interval $[-1, 1]$. The minimum is achieved by infinitely many points. Gradient descent starting near $x = 0$ has no gradient signal telling it to move at all - it is already at a minimum, but so is every other point in $[-1, 1]$.

More subtly: consider a convex function whose level curves form a very elongated ellipse. In the steep direction, gradient descent makes fast progress. In the flat direction, gradient descent crawls. The flatness creates slow convergence even though the function is convex.

Strong convexity is the condition that rules out both problems.

Definition. A function $f$ is $\mu$-strongly convex (with parameter $\mu > 0$) if for all $x, y$ and $\lambda \in [0, 1]$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) - \frac{\mu}{2}\lambda(1-\lambda)|x-y|^2.$$

The additional term $-\frac{\mu}{2}\lambda(1-\lambda)|x-y|^2$ is always non-positive (since $\lambda, 1-\lambda \geq 0$). This is a strictly stronger condition: the chord lies above the function, but by at least a quadratic amount that grows with the distance between $x$ and $y$.

Equivalently, for twice-differentiable $f$: $\mu$-strong convexity holds if and only if $H_f(x) \succeq \mu I$ for all $x$ - the Hessian is bounded below (in the positive semidefinite order) by $\mu$ times the identity. In one dimension: $f''(x) \geq \mu > 0$ everywhere.

Consequences of strong convexity:

  1. There is a unique global minimum $x^\ast$.

  2. The gradient inequality strengthens: $f(y) \geq f(x) + \nabla f(x) \cdot (y-x) + \frac{\mu}{2}|y-x|^2$. The function grows at least quadratically away from any point.

  3. Gradient descent converges geometrically fast - see the next section.

Example. Ridge regression minimizes $|y - Xw|^2 + \lambda|w|^2$. The Hessian is $2X^T X + 2\lambda I$. Even if $X^T X$ is singular (as happens when there are more features than samples), the regularization term adds $2\lambda I$ to the Hessian, making the smallest eigenvalue at least $2\lambda > 0$. Ridge regression is $2\lambda$-strongly convex. It always has a unique solution, and gradient descent always converges geometrically fast.


Section 10: Gradient Descent Convergence Rates

With convexity and strong convexity defined, we can state precisely what gradient descent achieves.

The setup: function $f$ is differentiable, and its gradient is $L$-Lipschitz continuous:

$$|\nabla f(x) - \nabla f(y)| \leq L|x - y| \quad \text{for all } x, y.$$

This condition ($L$-smoothness) means the gradient cannot change too abruptly. It implies the function cannot curve up more steeply than a parabola with parameter $L$. The gradient descent update with step size $1/L$ is guaranteed to decrease the function:

$$x_{t+1} = x_t - \frac{1}{L}\nabla f(x_t).$$

Discomfort check. Why step size exactly $1/L$?

The $L$-smoothness condition gives an upper bound on $f$: $f(y) \leq f(x) + \nabla f(x) \cdot (y-x) + \frac{L}{2}|y-x|^2$. Setting $y = x - \eta \nabla f(x)$ (a gradient step) and optimizing over $\eta$, the best choice is $\eta = 1/L$. Larger steps risk overshooting; the $L$-smoothness guarantees that $1/L$ is always safe in the sense that the function decreases by at least $\frac{1}{2L}|\nabla f(x)|^2$ at each step.

Theorem (Convex case). If $f$ is convex and $L$-smooth, gradient descent with step size $1/L$ satisfies:

$$f(x_t) - f(x^\ast) \leq \frac{L|x_0 - x^\ast|^2}{2t}.$$

The error after $t$ steps is at most $C/t$ for a constant $C$ depending on the initial distance to the optimum and the smoothness. This is $O(1/t)$ convergence. To achieve error $\epsilon$, you need $O(L/\epsilon)$ iterations.

Theorem (Strongly convex case). If $f$ is $\mu$-strongly convex and $L$-smooth, gradient descent with step size $1/L$ satisfies:

$$|x_t - x^\ast|^2 \leq \left(1 - \frac{\mu}{L}\right)^t |x_0 - x^\ast|^2.$$

The error decays by a constant factor $\rho = 1 - \mu/L < 1$ at each iteration. This is geometric (linear) convergence: the error halves every $O(\log(1/\rho))$ steps, regardless of how close you are to the minimum. To achieve error $\epsilon$, you need $O(\kappa \log(1/\epsilon))$ iterations, where $\kappa = L/\mu$ is the condition number.

The condition number $\kappa$ is the ratio of the maximum curvature to the minimum curvature. A perfect sphere ($\kappa = 1$) converges in one step. A very elongated ellipsoid ($\kappa = 10^6$) converges in $10^6 \log(1/\epsilon)$ steps - orders of magnitude more.

The practical implications:

  • Set the learning rate to $1/L$ (or equivalently, estimate $L$ and use it).
  • If convergence is slow, investigate the condition number. Ill-conditioned problems call for preconditioning (changing the geometry) or second-order methods.
  • The gap between $O(1/t)$ convex convergence and $O(\rho^t)$ strongly convex convergence is enormous. Adding even a tiny amount of L2 regularization ($\lambda > 0$) turns a convex problem into a strongly convex one with $\mu = 2\lambda$, changing convergence from polynomial to geometric.

Discomfort check. The strongly convex convergence rate is $O(\rho^t)$ - better than $O(1/t)$ convex convergence. Does that mean strongly convex problems are always easier?

In terms of convergence, yes - geometric beats polynomial. But the condition number $\kappa = L/\mu$ appears in the exponent. If $\kappa$ is large (even though $\mu > 0$), the decay factor $\rho = 1 - 1/\kappa$ is close to $1$ and convergence is slow. A function that is strongly convex with a tiny $\mu$ is technically in the geometric convergence regime but the constant makes it nearly as slow as $O(1/t)$. Strong convexity guarantees geometric convergence; the condition number determines how fast that geometric convergence actually is.


Section 11: Reading the Landscape

With all these tools, you can now approach any optimization problem with a diagnostic mindset.

First question: is the problem convex? Check whether the objective is convex (is it a non-negative combination of convex building blocks, composed with affine maps?) and whether the feasible set is convex (intersections of halfspaces, balls, simplex?). If both hold - you are in the good world. Any algorithm that decreases the objective and terminates at a stationary point has found the global optimum.

Second question: is it strongly convex? Check whether the Hessian has a bounded positive minimum eigenvalue. If so, gradient descent converges geometrically fast. If not, you are in the weaker $O(1/t)$ convergence regime. Adding L2 regularization is a simple fix when strong convexity is needed.

Third question: what is the condition number? Even for strongly convex problems, a large $\kappa$ makes gradient descent slow. Second-order methods (Newton, L-BFGS) are not affected by condition number the same way and are worth considering for small-to-medium problems.

If the problem is non-convex: you lose the local-equals-global guarantee. You need heuristics: random restarts, momentum-based methods that help escape flat regions, learning rate schedules. The convergence guarantees degrade. But understanding what the convex world looks like is what lets you recognize when you have left it and understand what you have lost.

The convex case is not a special case - it is the benchmark. Logistic regression, SVMs, linear regression, LASSO, ridge regression, and most of classical machine learning live here. Neural networks do not. The deep learning revolution required accepting that we would work outside the convex world and developing heuristics that work in practice even without theoretical guarantees.


Read Next: