Lagrange Multipliers & KKT - Optimization When You Can't Move Freely
Helpful context:
- Gradients & Partial Derivatives - Slopes in Every Direction at Once
- Convexity - When Local Minimum Means Global Minimum
You manage a small factory. You want to maximize production output. You can hire more workers or buy more machines - but your total budget is fixed at $100,000. Output is a function of workers and machines, but not just any combination is affordable.
You want to find the fastest route between two cities. You want to minimize travel time. But you cannot fly in a straight line - you must stay on the road network.
You want to find the point on the unit circle that is farthest northeast. The objective is clear (maximize $x + y$) and the constraint is clear (stay on $x^2 + y^2 = 1$). But substituting the constraint into the objective produces $\sqrt{1-y^2} + y$, which while solvable, has obscured all the geometric insight.
These are all constrained optimization problems: optimize $f$ subject to $g = c$. The naive approach - substitute the constraint directly into the objective and reduce dimensions - often produces messy algebra that hides what is really happening. There is a better approach, one that adds a new variable instead of eliminating one, and in doing so reveals the deep relationship between the objective and the constraint at the optimum.
Section 1: The Geometry of the Constrained Problem
Start with the simplest version: maximize $f(x, y)$ subject to the equality constraint $g(x, y) = c$.
Think about what the constraint means geometrically. The equation $g(x, y) = c$ defines a curve in the plane - the feasible set. You cannot optimize freely over all of $\mathbb{R}^2$; you must stay on this curve.
Now think about the objective. The level curves of $f$ - the sets $\{(x,y) : f(x,y) = k\}$ for various values of $k$ - form a family of curves filling the plane. As $k$ increases, these level curves shift (in a direction that depends on $f$). You want to find the largest $k$ for which the level curve $f = k$ still intersects the constraint curve $g = c$.
At the optimal point, something geometric happens: the level curve of $f$ and the constraint curve must be tangent. They touch but do not cross.
Why must they be tangent? Suppose they cross at the supposedly optimal point $(x^\ast, y^\ast)$. Then near $(x^\ast, y^\ast)$, the constraint curve passes through regions where $f > k$ (on one side of the crossing) and regions where $f < k$ (on the other side). Moving along the constraint curve in the direction where $f$ increases gives a feasible point with a higher objective value. So $(x^\ast, y^\ast)$ cannot be optimal.
Only when the two curves are tangent - when they are locally parallel and do not cross - is there no improving direction along the constraint.
Tangency of two curves at a point means their normal vectors (perpendiculars to the tangent lines) are parallel. The normal to the level curve $f = k$ at $(x^\ast, y^\ast)$ is $\nabla f(x^\ast, y^\ast)$ (the gradient is always perpendicular to level curves). The normal to the constraint curve $g = c$ is $\nabla g(x^\ast, y^\ast)$. Parallel normals means:
$$\nabla f(x^\ast, y^\ast) = \lambda \nabla g(x^\ast, y^\ast)$$
for some scalar $\lambda$.
This is the Lagrange condition.
Section 2: The Lagrange Multiplier
At the constrained optimum $(x^\ast, y^\ast)$, there exists a scalar $\lambda$ - the Lagrange multiplier - such that:
$$\nabla f(x^\ast, y^\ast) = \lambda \nabla g(x^\ast, y^\ast).$$
This says $\nabla f$ and $\nabla g$ are parallel at the optimum. Parallel vectors point in the same or opposite directions, meaning one is a scalar multiple of the other. The scalar is $\lambda$.
Writing out the two component equations:
$$\frac{\partial f}{\partial x}(x^\ast, y^\ast) = \lambda \frac{\partial g}{\partial x}(x^\ast, y^\ast),$$
$$\frac{\partial f}{\partial y}(x^\ast, y^\ast) = \lambda \frac{\partial g}{\partial y}(x^\ast, y^\ast).$$
Together with the constraint $g(x^\ast, y^\ast) = c$, this is a system of three equations in three unknowns: $x^\ast$, $y^\ast$, and $\lambda$.
Discomfort check. Parallel gradients - but what does that mean geometrically, and why is $\lambda$ not necessarily positive?
The gradient $\nabla f$ points in the direction of steepest ascent of $f$. The gradient $\nabla g$ points in the direction of steepest ascent of $g$. At the constrained optimum, moving along the constraint (in any direction tangent to $g = c$) must not improve $f$. This happens precisely when $\nabla f$ has no component along the constraint - when $\nabla f$ is entirely in the direction perpendicular to the constraint, which is the direction of $\nabla g$. Hence they are parallel.
The sign of $\lambda$ depends on whether $\nabla f$ and $\nabla g$ point the same way or opposite ways. When maximizing $f$ subject to a constraint that is “in the way” (the unconstrained maximum is outside the feasible set and the constraint pulls you inward), $\lambda$ can be positive or negative depending on the geometry. Unlike in the inequality constraint case (later in this post), there is no sign restriction on $\lambda$ for equality constraints.
Section 3: The Lagrangian
The Lagrange conditions $\nabla f = \lambda \nabla g$ plus the constraint $g = c$ come from a single elegant object.
Definition. The Lagrangian is the function:
$$\mathcal{L}(x, y, \lambda) = f(x, y) - \lambda(g(x, y) - c).$$
The Lagrangian treats $\lambda$ as a new variable alongside $x$ and $y$. Notice the structure: $f$ is the objective we want to optimize; $-\lambda(g - c)$ penalizes violation of the constraint. When the constraint is satisfied, $g = c$ and the second term is zero; the Lagrangian equals $f$. When the constraint is violated, the penalty is active.
Compute the gradient of $\mathcal{L}$ with respect to all three variables:
$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial f}{\partial x} - \lambda \frac{\partial g}{\partial x},$$
$$\frac{\partial \mathcal{L}}{\partial y} = \frac{\partial f}{\partial y} - \lambda \frac{\partial g}{\partial y},$$
$$\frac{\partial \mathcal{L}}{\partial \lambda} = -(g(x, y) - c).$$
Setting all three to zero:
- Setting $\partial \mathcal{L}/\partial x = 0$ and $\partial \mathcal{L}/\partial y = 0$ gives $\nabla f = \lambda \nabla g$ - exactly the Lagrange condition.
- Setting $\partial \mathcal{L}/\partial \lambda = 0$ gives $g(x, y) = c$ - exactly the constraint.
The Lagrangian encodes both the optimality condition and the constraint as stationarity of a single function in a higher-dimensional space. A constrained optimization problem in $(x, y)$ becomes an unconstrained stationarity problem in $(x, y, \lambda)$. We have not made the problem simpler - we have changed its form in a way that often makes it more tractable.
Section 4: Two Complete Examples
Example 1: Farthest point on the unit circle.
Maximize $f(x, y) = x + y$ subject to $g(x, y) = x^2 + y^2 = 1$.
We want the point on the unit circle where the sum $x + y$ is largest - the point farthest in the northeast direction.
The Lagrangian: $\mathcal{L} = x + y - \lambda(x^2 + y^2 - 1)$.
Setting the partial derivatives to zero:
$$\frac{\partial \mathcal{L}}{\partial x} = 1 - 2\lambda x = 0 \implies x = \frac{1}{2\lambda}.$$
$$\frac{\partial \mathcal{L}}{\partial y} = 1 - 2\lambda y = 0 \implies y = \frac{1}{2\lambda}.$$
So $x = y$. Substituting into the constraint: $x^2 + x^2 = 1$, giving $x = y = \frac{1}{\sqrt{2}}$.
The maximum value: $f = \frac{1}{\sqrt{2}} + \frac{1}{\sqrt{2}} = \sqrt{2}$.
The multiplier: $\lambda = \frac{1}{2x} = \frac{\sqrt{2}}{2}$.
Sanity check: $\nabla f = (1, 1)$ points northeast. $\nabla g = (2x, 2y) = (\sqrt{2}, \sqrt{2})$ also points northeast (radially outward from the origin). They are parallel with ratio $\lambda = \frac{\sqrt{2}}{2}$. The Lagrange condition holds.
There is also a local minimum of $f$ on the circle, at $(-1/\sqrt{2}, -1/\sqrt{2})$, giving $f = -\sqrt{2}$. The Lagrange conditions find all stationary points; you must check which are maxima and which are minima.
Example 2: Closest point on a line to the origin.
Minimize $f(x, y) = x^2 + y^2$ subject to $g(x, y) = x + y = 1$.
We want the point on the line $x + y = 1$ closest to the origin. Minimizing the squared distance is equivalent to minimizing distance (the square root is monotone).
The Lagrangian: $\mathcal{L} = x^2 + y^2 - \lambda(x + y - 1)$.
Setting partial derivatives to zero:
$$\frac{\partial \mathcal{L}}{\partial x} = 2x - \lambda = 0 \implies x = \frac{\lambda}{2}.$$
$$\frac{\partial \mathcal{L}}{\partial y} = 2y - \lambda = 0 \implies y = \frac{\lambda}{2}.$$
So $x = y$. From the constraint: $x + y = 1$ gives $2x = 1$, so $x = y = \frac{1}{2}$.
Minimum value: $f = (1/2)^2 + (1/2)^2 = 1/2$. Minimum distance to the origin: $1/\sqrt{2}$.
The geometric check: the closest point on a line to the origin is where the perpendicular from the origin meets the line. The line $x + y = 1$ has direction vector $(1, -1)$. The perpendicular from the origin to this line passes through $(1/2, 1/2)$ - consistent with our answer.
Section 5: Multiple Constraints
The framework extends naturally to multiple equality constraints. With constraints $g_1(x) = c_1, \ldots, g_m(x) = c_m$, introduce one multiplier per constraint:
$$\mathcal{L}(x, \lambda_1, \ldots, \lambda_m) = f(x) - \sum_{i=1}^m \lambda_i (g_i(x) - c_i).$$
Setting $\nabla_x \mathcal{L} = 0$ gives the stationarity condition:
$$\nabla f(x) = \sum_{i=1}^m \lambda_i \nabla g_i(x).$$
At the optimum, the gradient of the objective is a linear combination of the constraint gradients. Geometrically: the gradient of $f$ has no component in any direction tangent to the constraint manifold (the intersection of all constraint surfaces). Whatever improvement direction $\nabla f$ suggests is entirely in the “normal” direction to the feasible set - there is no feasible improving direction.
Setting $\partial \mathcal{L}/\partial \lambda_i = 0$ recovers each constraint $g_i(x) = c_i$.
The system of equations is: $n$ components from $\nabla_x \mathcal{L} = 0$, plus $m$ constraints, for $n + m$ equations in $n + m$ unknowns $(x, \lambda_1, \ldots, \lambda_m)$.
Regularity condition. The Lagrange conditions are necessary for optimality provided the constraint gradients $\nabla g_1(x^\ast), \ldots, \nabla g_m(x^\ast)$ are linearly independent at the optimum. If they are linearly dependent, the conditions may fail. This regularity assumption is called a constraint qualification.
Section 6: Inequality Constraints and the KKT Conditions
Equality constraints are the clean case: the feasible set is a surface, and the optimum is a boundary point. Real problems usually have inequality constraints: $g_i(x) \leq 0$.
The key new feature is that an inequality constraint can be active (holding as equality: $g_i(x^\ast) = 0$, the optimum is on the boundary) or inactive (strict inequality: $g_i(x^\ast) < 0$, the optimum is in the interior relative to this constraint). When a constraint is inactive, it is not restricting the optimum - we could relax it slightly and the optimum would not change. When it is active, it is binding: the optimum is right on the boundary of the feasible region for that constraint.
The Karush-Kuhn-Tucker (KKT) conditions characterize the optimum when inequality constraints are present.
Setup. Minimize $f(x)$ subject to $g_i(x) \leq 0$ for $i = 1, \ldots, m$.
KKT conditions. At a local minimum $x^\ast$ (under a constraint qualification), the following must hold:
Stationarity:
$$\nabla f(x^\ast) = \sum_{i=1}^m \lambda_i \nabla g_i(x^\ast).$$
Primal feasibility:
$$g_i(x^\ast) \leq 0 \quad \text{for all } i.$$
Dual feasibility:
$$\lambda_i \geq 0 \quad \text{for all } i.$$
Complementary slackness:
$$\lambda_i g_i(x^\ast) = 0 \quad \text{for all } i.$$
Let us examine each condition.
Stationarity is the generalized Lagrange condition: the objective gradient is a combination of constraint gradients. For inactive constraints, complementary slackness forces $\lambda_i = 0$, so inactive constraints drop out of the stationarity condition automatically.
Primal feasibility simply says the point is feasible - it satisfies all constraints. Nothing new here.
Dual feasibility requires $\lambda_i \geq 0$ for inequality constraints. This is the key difference from equality constraints, where $\lambda$ can be any sign. For a minimization problem with $g_i \leq 0$ constraints: the constraint gradient $\nabla g_i$ points away from the feasible region (in the direction of increasing $g_i$, which is the infeasible direction). The stationarity condition says $\nabla f$ is a combination of these “outward” gradients. For this to make geometric sense at a minimum (where the feasible region is constraining $f$ from getting smaller), the multipliers must be non-negative.
Complementary slackness says $\lambda_i g_i(x^\ast) = 0$. Since $g_i(x^\ast) \leq 0$ (primal feasibility) and $\lambda_i \geq 0$ (dual feasibility), the product $\lambda_i g_i(x^\ast) \leq 0$. The condition that it equals zero means exactly one of two things holds for each $i$:
- $\lambda_i = 0$: the multiplier is zero. The constraint may be active or inactive, but it plays no role in the stationarity condition.
- $g_i(x^\ast) = 0$: the constraint is active (the optimum is on its boundary).
Both can simultaneously be zero. But you cannot have $\lambda_i > 0$ and $g_i(x^\ast) < 0$ simultaneously.
Discomfort check. Why can’t you have $\lambda_i > 0$ and $g_i(x^\ast) < 0$?
An inactive constraint ($g_i(x^\ast) < 0$) means there is slack - the feasible region allows movement in all directions near $x^\ast$ without violating this particular constraint. If $\lambda_i > 0$ for such a constraint, it would contribute to the stationarity condition even though the constraint is not binding. This would be incorrect: you would be attributing some of the “force” balancing $\nabla f$ to a constraint that is not actually pushing on you. The KKT conditions capture only the active constraints' influence on the optimal direction, and inactive constraints have zero influence ($\lambda_i = 0$).
Section 7: A Worked Example with Inequalities
Minimize $f(x, y) = (x-3)^2 + (y-2)^2$ subject to $g_1(x, y) = x^2 + y^2 - 4 \leq 0$ and $g_2(x, y) = y - 1 \leq 0$.
This is: find the point closest to $(3, 2)$ that lies inside the disk of radius $2$ centered at the origin, with $y \leq 1$.
The unconstrained minimum is $(3, 2)$ itself, with $f = 0$. But $(3, 2)$ has $x^2 + y^2 = 13 > 4$, so it violates the disk constraint. The optimum must be on the boundary of the feasible region.
Check: does $(3, 2)$ violate $y \leq 1$? Yes, $2 > 1$. So potentially both constraints are active.
Let us try the case where both are active: $g_1 = 0$ and $g_2 = 0$, meaning $x^2 + y^2 = 4$ and $y = 1$. From $y = 1$: $x^2 = 3$, so $x = \sqrt{3}$ (taking the positive root since $(3, 2)$ is in the right half-plane).
At $(x^\ast, y^\ast) = (\sqrt{3}, 1)$, check stationarity:
$$\nabla f = (2(x-3), 2(y-2)) = (2\sqrt{3}-6, -2).$$
$$\nabla g_1 = (2x, 2y) = (2\sqrt{3}, 2), \quad \nabla g_2 = (0, 1).$$
Stationarity: $(2\sqrt{3}-6, -2) = \lambda_1(2\sqrt{3}, 2) + \lambda_2(0, 1)$.
From the first component: $2\sqrt{3} - 6 = 2\sqrt{3}\lambda_1$, giving $\lambda_1 = 1 - \sqrt{3} \approx -0.73$.
But dual feasibility requires $\lambda_1 \geq 0$. Contradiction. So both constraints cannot simultaneously be active at the optimum.
Try only $g_1$ active: $x^2 + y^2 = 4$, $g_2 < 0$ (so $\lambda_2 = 0$). Stationarity becomes $\nabla f = \lambda_1 \nabla g_1$:
$$(2(x-3), 2(y-2)) = \lambda_1 (2x, 2y).$$
This gives $x - 3 = \lambda_1 x$ and $y - 2 = \lambda_1 y$, so $x(1-\lambda_1) = 3$ and $y(1-\lambda_1) = 2$. Thus $y/x = 2/3$. Combined with $x^2 + y^2 = 4$: $x^2(1 + 4/9) = 4$, so $x = \frac{6}{\sqrt{13}}$ and $y = \frac{4}{\sqrt{13}}$.
Check $g_2$: $y = 4/\sqrt{13} \approx 1.11 > 1$. This violates $y \leq 1$. So only $g_1$ active does not work either.
Try only $g_2$ active: $y = 1$, and $g_1 < 0$ (inside the disk, so $\lambda_1 = 0$). Minimize $(x-3)^2 + (1-2)^2 = (x-3)^2 + 1$ over $x$ with $x^2 + 1 \leq 4$, i.e., $|x| \leq \sqrt{3}$. The unconstrained minimum is $x = 3$, but $3 > \sqrt{3}$, so the minimum on the constraint is $x = \sqrt{3}$.
Check stationarity: $\nabla f = (2(\sqrt{3}-3), 2(1-2)) = (2\sqrt{3}-6, -2)$ must equal $\lambda_2(0, 1)$. The first component requires $2\sqrt{3} - 6 = 0$, which fails. So this case is also degenerate at $x = \sqrt{3}$ - the $g_1$ constraint is actually active here (since $(\sqrt{3})^2 + 1^2 = 4$). We are back to both constraints being active, which we showed requires $\lambda_1 < 0$.
The resolution: both constraints are active at the optimal point $(\sqrt{3}, 1)$, and $\lambda_1 < 0$ is required. But we said $\lambda_1 \geq 0$ for inequality constraints in a minimization problem. The issue is the constraint $g_1 = x^2 + y^2 - 4 \leq 0$ is an “inside the disk” constraint, and the optimal point wants to be outside the disk but is being forced inside. The multiplier sign reflects this: the disk constraint is actively pulling the optimum away from where it “wants” to be.
This example illustrates that KKT conditions require careful bookkeeping of signs and that degenerate cases (multiple active constraints, multiplier sign violations) require case analysis.
Section 8: Connection to Support Vector Machines
The SVM is one of the most elegant applications of KKT conditions.
The hard-margin SVM finds the maximum-margin separating hyperplane. Given training data $\{(x_i, y_i)\}$ with labels $y_i \in \{-1, +1\}$ and a hyperplane $w^T x + b = 0$, the margin is the distance from the hyperplane to the nearest training point. The SVM maximizes this margin, which is equivalent to minimizing $|w|^2$:
$$\min_{w, b} \frac{1}{2}|w|^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \geq 1 \quad \text{for all } i.$$
Write the constraints as $g_i(w, b) = 1 - y_i(w^T x_i + b) \leq 0$. This is a quadratic program (convex objective, linear inequality constraints), so KKT conditions are both necessary and sufficient.
The KKT stationarity condition gives:
$$w = \sum_i \lambda_i y_i x_i, \qquad \sum_i \lambda_i y_i = 0.$$
Complementary slackness: $\lambda_i(1 - y_i(w^T x_i + b)) = 0$ for all $i$.
The complementary slackness condition divides training points into two groups:
- Support vectors: $y_i(w^T x_i + b) = 1$ (on the margin boundary). These can have $\lambda_i > 0$.
- Non-support vectors: $y_i(w^T x_i + b) > 1$ (correctly classified with margin to spare). Complementary slackness forces $\lambda_i = 0$.
The stationarity condition $w = \sum_i \lambda_i y_i x_i$ says the weight vector is a linear combination of training points, but only the support vectors contribute (since all other $\lambda_i = 0$). The hyperplane is entirely determined by a small subset of the data.
This sparsity is not an approximation or a heuristic - it is an exact consequence of KKT complementary slackness. You can add or remove any non-support vector from the training set without changing the solution.
Section 9: The Shadow Price Interpretation
Lagrange multipliers have a concrete economic meaning that often clarifies their value.
Consider: maximize profit $f(x)$ subject to budget constraint $g(x) \leq B$. The optimal value is $f^\ast(B)$ - the maximum achievable profit with budget $B$. At the optimum, the Lagrange multiplier $\lambda^\ast$ satisfies:
$$\lambda^\ast = \frac{df^\ast}{dB}.$$
The multiplier is the rate of change of the optimal value with respect to the constraint level. If the budget increases by $\delta$, the optimal profit increases by approximately $\lambda^\ast \cdot \delta$.
This is the shadow price of the constraint: the value of one additional unit of the resource being constrained. A constraint with a high shadow price is a bottleneck - relaxing it slightly gives large gains. A constraint with a zero shadow price (an inactive constraint) has no cost: relaxing it does not help because you were not using it to its limit anyway.
In the farthest-point example: the constraint was $x^2 + y^2 = c$. The optimal value of $x + y$ on the circle of radius $\sqrt{c}$ is $\sqrt{2c}$. So $f^\ast(c) = \sqrt{2c}$ and $df^\ast/dc = 1/\sqrt{2c}$. At $c = 1$: $\lambda^\ast = 1/\sqrt{2} = \sqrt{2}/2$, which matches the multiplier we computed. Every unit increase in the “radius-squared” budget increases the achievable northeast distance by $\sqrt{2}/2$.
The shadow price interpretation makes multipliers tangible. In practice, multipliers are used in economics (shadow prices of regulatory constraints, resource limits), engineering (marginal value of relaxing physical constraints), and algorithm design (the dual problem uses multipliers to derive lower bounds).
Section 10: KKT Sufficiency for Convex Problems
The KKT conditions are always necessary for optimality (under a regularity condition). For convex problems, they are also sufficient.
Theorem. Suppose $f$ and $g_1, \ldots, g_m$ are all convex. If $x^\ast$ satisfies the KKT conditions (stationarity, primal feasibility, dual feasibility, complementary slackness), then $x^\ast$ is a global minimizer of $f$ subject to $g_i(x) \leq 0$.
Proof sketch. The stationarity condition gives $\nabla f(x^\ast) = \sum_i \lambda_i \nabla g_i(x^\ast)$ with $\lambda_i \geq 0$. For any feasible $x$:
$$f(x) \geq f(x^\ast) + \nabla f(x^\ast) \cdot (x - x^\ast) \quad \text{(convexity of } f\text{)}$$
$$= f(x^\ast) + \sum_i \lambda_i \nabla g_i(x^\ast) \cdot (x - x^\ast) \quad \text{(stationarity)}$$
$$\geq f(x^\ast) + \sum_i \lambda_i (g_i(x) - g_i(x^\ast)) \quad \text{(convexity of each } g_i\text{)}$$
$$\geq f(x^\ast) + \sum_i \lambda_i (g_i(x) - 0) \quad \text{(complementary slackness: } g_i(x^\ast) = 0 \text{ or } \lambda_i = 0\text{)}$$
$$\geq f(x^\ast) \quad \text{(dual feasibility: } \lambda_i \geq 0\text{, primal feasibility: } g_i(x) \leq 0\text{)}. \quad \square$$
This theorem transforms the computational problem. For convex constrained problems, instead of searching over all feasible points, you solve the KKT system. Any solution to the KKT system is a global optimum. The next post builds the full convex optimization machinery on this foundation.
Read Next: