Lagrange Multipliers & KKT Conditions
Prerequisite:
Unconstrained optimization asks: find $x$ minimizing $f(x)$ over all of $\mathbb{R}^n$. Constrained optimization adds requirements: the minimizer must lie in some feasible set defined by equations or inequalities. The Lagrange multiplier framework is the systematic way to handle such constraints - it converts a constrained problem into an unconstrained system of equations at the cost of introducing auxiliary variables.
Equality-Constrained Optimization
The equality-constrained problem is
$$\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) = 0, \quad i = 1, \ldots, m.$$
The feasible set is ${x : g_i(x) = 0 \text{ for all } i}$, a (generically) $(n - m)$-dimensional surface.
Geometric intuition. At a constrained minimum $x^\ast$, the objective $f$ cannot decrease in any direction that stays on the constraint surface. Formally: if $v$ is a tangent vector to the constraint surface at $x^\ast$ (i.e., $\nabla g_i(x^\ast)^T v = 0$ for all $i$), then $\nabla f(x^\ast)^T v = 0$. Otherwise, moving along $v$ would decrease $f$ while remaining feasible, contradicting optimality.
This means $\nabla f(x^\ast)$ is orthogonal to every tangent vector - it lies in the span of the constraint gradients $\nabla g_1(x^\ast), \ldots, \nabla g_m(x^\ast)$.
f decreasing
<---------
________________________________________
/ \
| x* <- constrained minimum | constraint surface g(x) = 0
\________________________________________/
At x*: grad f must point perpendicular to the surface
(i.e., parallel to grad g)
Otherwise we could slide along the surface to decrease f
Definition (Lagrangian). The Lagrangian for the equality-constrained problem is
$$\mathcal{L}(x, \lambda) = f(x) + \sum_{i=1}^m \lambda_i g_i(x),$$
where $\lambda = (\lambda_1, \ldots, \lambda_m) \in \mathbb{R}^m$ are the Lagrange multipliers.
Theorem (Lagrange Multiplier Theorem). Let $x^\ast$ be a local minimum of $f$ subject to $g_i(x) = 0$, and suppose the constraint qualification holds: the gradients $\nabla g_1(x^\ast), \ldots, \nabla g_m(x^\ast)$ are linearly independent (LICQ - Linear Independence Constraint Qualification). Then there exist unique $\lambda^\ast \in \mathbb{R}^m$ such that
$$\nabla_x \mathcal{L}(x^\ast, \lambda^\ast) = \nabla f(x^\ast) + \sum_{i=1}^m \lambda_i^\ast \nabla g_i(x^\ast) = 0.$$
Together with $g_i(x^\ast) = 0$, this gives $n + m$ equations in $n + m$ unknowns $(x^\ast, \lambda^\ast)$.
The multipliers $\lambda_i^\ast$ have a shadow price interpretation: $\lambda_i^\ast = -\partial f^\ast / \partial b_i$ where $b_i$ shifts the constraint $g_i(x) = b_i$. It measures how much the optimal value changes if we slightly relax constraint $i$.
Example: Constrained Quadratic
Minimize $f(x, y) = x^2 + y^2$ subject to $x + y = 1$.
$\mathcal{L} = x^2 + y^2 + \lambda(x + y - 1)$.
Stationarity: $\nabla_x \mathcal{L} = 2x + \lambda = 0$, $\nabla_y \mathcal{L} = 2y + \lambda = 0$, so $x = y = -\lambda/2$.
Feasibility: $x + y = 1$ gives $-\lambda = 1$, so $\lambda = -1$ and $x^\ast = y^\ast = 1/2$.
The shadow price $\lambda^\ast = -1$ means relaxing the constraint to $x + y = 1 + \epsilon$ decreases the minimum from $1/2$ by approximately $\epsilon$ (the minimum of $x^2 + y^2$ on $x + y = c$ is $c^2/2$, so $\partial/\partial c (c^2/2)|_{c=1} = 1$, confirming $|\lambda^\ast| = 1$).
KKT Conditions for Inequality Constraints
The general constrained problem with both equality and inequality constraints is
$$\min_{x} f(x) \quad \text{subject to} \quad h_j(x) \leq 0 ; (j = 1, \ldots, p), \quad g_i(x) = 0 ; (i = 1, \ldots, m).$$
The inequality constraints $h_j(x) \leq 0$ divide into active constraints ($h_j(x^\ast) = 0$, which are binding) and inactive constraints ($h_j(x^\ast) < 0$, which are slack).
The Lagrangian is extended as
$$\mathcal{L}(x, \lambda, \mu) = f(x) + \sum_{i=1}^m \lambda_i g_i(x) + \sum_{j=1}^p \mu_j h_j(x).$$
Theorem (KKT Necessary Conditions). Suppose $x^\ast$ is a local minimum and LICQ holds at $x^\ast$. Then there exist $\lambda^\ast \in \mathbb{R}^m$ and $\mu^\ast \in \mathbb{R}^p$ satisfying:
-
Stationarity: $\nabla f(x^\ast) + \sum_i \lambda_i^\ast \nabla g_i(x^\ast) + \sum_j \mu_j^\ast \nabla h_j(x^\ast) = 0$
-
Primal feasibility: $h_j(x^\ast) \leq 0$ and $g_i(x^\ast) = 0$
-
Dual feasibility: $\mu_j^\ast \geq 0$ for all $j$
-
Complementary slackness: $\mu_j^\ast h_j(x^\ast) = 0$ for all $j$
The sign constraint $\mu_j^\ast \geq 0$ encodes direction: an active inequality constraint $h_j = 0$ can only push the optimum away from $h_j > 0$, so the multiplier must be non-negative to correctly penalize constraint violation.
Complementary slackness says: either the constraint is active ($h_j = 0$) or the multiplier is zero ($\mu_j = 0$). An inactive constraint is irrelevant at the optimum - it imposes no restriction - so its multiplier must vanish.
Constraint Qualification
Without constraint qualification, KKT conditions may fail to be necessary. Consider $\min x$ subject to $x^2 \leq 0$: the only feasible point is $x^\ast = 0$, but $\nabla h(0) = 0$, so no multiplier $\mu$ can satisfy stationarity $1 + \mu \cdot 0 = 0$.
LICQ requires that the gradients of active constraints at $x^\ast$ are linearly independent. Other qualifications exist (Mangasarian-Fromovitz, Slater’s condition for convex problems) that ensure KKT is necessary.
Duality and Strong Duality
Definition. The dual function is
$$q(\lambda, \mu) = \inf_{x} \mathcal{L}(x, \lambda, \mu).$$
The dual problem is $\max_{\lambda, \mu \geq 0} q(\lambda, \mu)$.
Theorem (Weak Duality). For any primal feasible $x$ and dual feasible $(\lambda, \mu)$ (with $\mu \geq 0$),
$$q(\lambda, \mu) \leq f(x).$$
The dual objective is always a lower bound on the primal objective. The duality gap is $f(x^\ast) - q(\lambda^\ast, \mu^\ast)$.
Theorem (Strong Duality under Slater’s Condition). Suppose the primal problem is convex (convex objective and constraint functions) and there exists a strictly feasible point $\tilde{x}$ with $h_j(\tilde{x}) < 0$ for all $j$ (Slater’s condition). Then strong duality holds: the duality gap is zero, $f(x^\ast) = q(\lambda^\ast, \mu^\ast)$, and the dual is attained.
Strong duality means: the best lower bound you can get from the dual equals the primal optimal value. This allows solving the dual instead of the primal - often easier when the primal has many variables but few constraints.
Application: Support Vector Machines
The SVM primal problem is: find the maximum-margin hyperplane separating two classes.
For linearly separable data $(x_i, y_i)$ with $y_i \in {-1, +1}$:
$$\min_{w, b} \frac{1}{2}|w|^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \geq 1 \quad \text{for all } i.$$
Rewrite as $h_i(w, b) = 1 - y_i(w^T x_i + b) \leq 0$.
Lagrangian: $\mathcal{L} = \frac{1}{2}|w|^2 - \sum_i \alpha_i (y_i(w^T x_i + b) - 1)$.
KKT stationarity:
$$\nabla_w \mathcal{L} = w - \sum_i \alpha_i y_i x_i = 0 \implies w^\ast = \sum_i \alpha_i y_i x_i$$
$$\nabla_b \mathcal{L} = -\sum_i \alpha_i y_i = 0$$
Dual function: Substituting $w^\ast = \sum_i \alpha_i y_i x_i$ back:
$$q(\alpha) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j$$
Dual problem:
$$\max_{\alpha \geq 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j \quad \text{subject to} \quad \sum_i \alpha_i y_i = 0.$$
This is a quadratic program in the $\alpha_i$ (dual variables). Slater’s condition holds, so strong duality gives $f^\ast = q^\ast$.
Complementary slackness $\alpha_i(y_i(w^T x_i + b) - 1) = 0$ says: either $\alpha_i = 0$ (point is not a support vector) or $y_i(w^T x_i + b) = 1$ (point is on the margin boundary). The optimal $w^\ast$ depends only on support vectors - the data points on the margin.
The kernel trick enters through the inner products $x_i^T x_j$ in the dual: replace them with $k(x_i, x_j)$ for any positive definite kernel, extending SVM to nonlinear decision boundaries without explicitly computing the feature map.
Examples
Constrained optimization appears throughout CS and ML:
Constrained neural network training: Fairness constraints ($\text{error rate}(\text{group A}) \leq \text{error rate}(\text{group B}) + \epsilon$), energy budgets, latency constraints. These are inequality constraints on the training problem.
Reinforcement learning: Policy optimization subject to trust region constraints ($D_\text{KL}(\pi_\text{old} | \pi_\text{new}) \leq \delta$). TRPO and PPO are derived from this constrained formulation.
Maximum entropy distributions: The maximum entropy distribution satisfying moment constraints $\mathbb{E}[f_i(x)] = c_i$ is found by a Lagrangian whose multipliers determine the exponential family distribution.
Resource allocation: Maximize utility subject to budget constraints - textbook LP duality gives primal (allocation) and dual (pricing) interpretations simultaneously.
The KKT conditions are both a practical recipe (write the Lagrangian, differentiate, solve the system) and a theoretical certificate (any optimal solution of a convex problem must satisfy KKT, and KKT solutions are globally optimal for convex problems).
Read Next: