Advanced Statistical Computing
Seoul National University
November 2023
We do not assume convexity by default here.
\[ \min_{\mathbf{x}\in \mathbb{R}^d} f(\mathbf{x}) \]
\[\begin{eqnarray*} &\text{minimize}& f_0(\mathbf{x}) \\ &\text{subject to}& f_i(\mathbf{x}) \le 0, \quad i = 1,\ldots,m \\ & & h_i(\mathbf{x}) = 0, \quad i = 1,\ldots,p. \end{eqnarray*}\]
Assume all of these functions share some open set \(U \subset \mathbb{R}^d\) as their domain.
Kuhn, H. W.; Tucker, A. W., Nonlinear Programming. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 481–492, University of California Press, Berkeley, Calif., 1951.
Karush, W., Minima of Functions of Several Variables with Inequalities as Side Constraints. M.Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, Chicago, Illinois, 1939.
Define the Lagrangian \[\begin{eqnarray*} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^p \nu_i h_i(\mathbf{x}). \end{eqnarray*}\] The vectors \(\boldsymbol{\lambda} = (\lambda_1,\ldots, \lambda_m)^T\) and \(\boldsymbol{\nu} = (\nu_1,\ldots,\nu_p)^T\) are called the Lagrange multiplier vectors or dual variables.
Let \(\mathbf{x}^\star\) be a local minimizer of \(f_0\) in the feasible region \(C\).
Assume \(f_0, f_1, \ldots,f_m,h_1,\ldots,h_p\) are continuously differentiable near \(\mathbf{x}^{\star}\).
Assume
(The vector \(\mathbf{v}\) is a tangent vector in the sense that infinitesimal motion from \(\mathbf{x}^{\star}\) along \(\mathbf{v}\) stays within the feasible region.)
Suppose that the objective function \(f_0\) of the constrained optimization problem above has a local minimum at the feasible point \(\mathbf{x}^{\star}\).
If \(f_0\) and the constraint functions are continuously differentiable near \(\mathbf{x}^{\star}\), and the Mangasarian-Fromovitz constraint qualification holds at \(\mathbf{x}^{\star}\), then there exist Lagrange multipliers \(\lambda_1^{\star}, \dotsc, \lambda_m^{\star}\) and \(\nu_1^{\star}, \dotsc, \nu_p^{\star}\) such that \[ \small \nabla_{\mathbf{x}}\mathcal{L}(\mathbf{x}^{\star},\boldsymbol{\lambda}^{\star}, \boldsymbol{\nu}^{\star})= \nabla f_0(\mathbf{x}^\star) + \sum_{i=1}^m \lambda_i^\star \nabla f_i(\mathbf{x}^\star) + \sum_{i=1}^p \nu_i^\star \nabla h_i(\mathbf{x}^\star) = \mathbf{0}. \]
Moreover, each of the multipliers \(\lambda_i^{\star}\) associated with the inequality constraints is nonnegative, and \(\lambda_i^{\star} = 0\) whenever \(f_i(\mathbf{x}^{\star}) < 0\).
We can summarize these as the KKT conditions \[ \begin{eqnarray*} f_i(\mathbf{x}^\star) &\le& 0, \quad i = 1,\ldots,m \\ h_i(\mathbf{x}^\star) &=& 0, \quad i = 1,\ldots,p \\ \lambda_i^\star &\ge& 0, \quad i = 1,\ldots,m \\ \lambda_i^\star f_i(\mathbf{x}^\star) &=& 0, \quad i=1,\ldots,m \\ \nabla f_0(\mathbf{x}^\star) + \sum_{i=1}^m \lambda_i^\star \nabla f_i(\mathbf{x}^\star) + \sum_{i=1}^p \nu_i^\star \nabla h_i(\mathbf{x}^\star) &=& \mathbf{0}. \end{eqnarray*} \]
The fourth condition \(\lambda_i^\star f_i(\mathbf{x}^\star) = 0\) is called the complementary slackness.
Remember that the KKT conditions are collectively a necessary condition for local optimality.
If the optimization problem is convex, then they become a necessary and sufficient condition, i.e., finding a triple \((\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) that satisfies the KKT conditions guarantees global optimiality of the triple.
Assume there exists a feasible point \(\mathbf{x}_0\) such that \(f_i(\mathbf{x}_0) < 0\) for active inequality constraints, i.e., for all \(i\) with \(f_i(\mathbf{x}^{\star})=0\).
The likelihood of observation \((n_1, \dotsc, n_r)\) from \(\text{mult}(n, (p_1, \dotsc, p_r))\) is \(\prod_{i=1}^r p_i^{n_i}\). The MLE problem is
\[\begin{eqnarray*} &\text{minimize}& -\sum_{i=1}^r n_i\log p_i \\ &\text{subject to}& -p_i \le 0, \quad i = 1,\dotsc,r \\ & & \sum_{i=1}^r p_i = 1. \end{eqnarray*}\]
This is a convex optimization problem. The Lagrangian is \[ \mathcal{L}(\mathbf{p}, \boldsymbol{\lambda}, \nu) = -\sum_{i=1}^r n_i\log p_i - \sum_{i=1}^r \lambda_i p_i + \nu\left(\sum_{i=1}^r p_i - 1\right). \] Differentiate with respect to \(\mathbf{p}\) to have \[ 0 = -\frac{n_i}{p_i} - \lambda_i + \nu, \quad i=1,\dotsc, r. \] Multiply \(p_i\) on both sides and sum on \(i\) to have \[ 0 = -n - \sum_{i=1}^r \lambda_i p_i + \nu \sum_{i=1}^r p_i = -n - 0 + \nu \] due to complementary slackness.
Setting all \(\lambda_i=0\) gives \(p_i=n_i/n\). The triple \(((n_1/n, \dotsc, n_r/n), \mathbf{0}, n)\) satisfies the KKT conditions and hence is a global solution (Even if some \(n_i=0\)).
Suppose the Lagrangian multiplier rule is satisfied at a point \(\mathbf{x}^{\star}\) with associated multipliers \(\boldsymbol{\lambda}\) and \(\boldsymbol{\nu}\).
Assume the Lagrangian \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) is twice continuously differentiable in \(\mathbf{x}\) near \(\mathbf{x}^{\star}\).
If \(\mathbf{v}^T\nabla^2_{\mathbf{x}}\mathcal{L}(\mathbf{x}^{\star}, \boldsymbol{\lambda}, \boldsymbol{\nu})\mathbf{v} > 0\) for all \(\mathbf{v}\neq\mathbf{0}\) satisfying \(\langle \nabla h_i(\mathbf{x}^{\star}), \mathbf{v} \rangle = 0\) for all \(i\) and \(\langle \nabla f_i(\mathbf{x}^{\star}), \mathbf{v} \rangle \le 0\) for all active constraints, then \(\mathbf{x}^{\star}\) is a local minimum of \(f_0\).
That is, \(\nabla^2_{\mathbf{x}}\mathcal{L}(\mathbf{x}^{\star}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) is positive definite for all tangent vectors at \(\mathbf{x}^{\star}\).
Partial inverse to the Lagrange multiplier rule.
Since each pair \((\boldsymbol{\lambda}, \boldsymbol{\nu})\) with \(\boldsymbol{\lambda} \geq \mathbf{0}\) gives a lower bound to the optimal value \(p^\star\). It is natural to ask for the best possible lower bound for \(p^{\star}\) the Lagrange dual function can provide. This leads to the Lagrange dual problem \[ \begin{array}{ll} \text{maximize}& g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \\ \text{subject to}& \boldsymbol{\lambda} \geq \mathbf{0}, \end{array} \] which is a convex problem whether or not the primal problem is convex.
Let \(d^\star=\sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}g(\boldsymbol{\lambda},\boldsymbol{\nu})\). Then we have weak duality \[\begin{eqnarray*} d^\star \le p^\star. \end{eqnarray*}\] The difference \(p^\star - d^\star\) is called the duality gap.
Refers to zero duality gap.
Strong duality occurs when (but not only when) the problem is convex and the Lagrange multiplier rule holds at a feasible point \(\hat{\mathbf{x}}\).
Let the corresponding Lagrange multipliers be \((\hat{\boldsymbol{\lambda}}, \hat{\boldsymbol{\nu}})\). In this case, \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) is convex in \(\mathbf{x}\). Since the Lagrange multiplier rule states that \(\nabla_{\mathbf{x}}\mathcal{L}(\hat{\mathbf{x}}, \hat{\boldsymbol{\lambda}}, \hat{\boldsymbol{\nu}}) = \mathbf{0}\), point \(\hat{\mathbf{x}}\) furnishes an unconstrained minimum of the Lagrangian. Hence \[ p^{\star} \le f(\hat{\mathbf{x}}) = \mathcal{L}(\hat{\mathbf{x}}, \hat{\boldsymbol{\lambda}}, \hat{\boldsymbol{\nu}}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \hat{\boldsymbol{\lambda}}, \hat{\boldsymbol{\nu}}) = g(\hat{\boldsymbol{\lambda}}, \hat{\boldsymbol{\nu}}) \le d^{\star}. \] (Why the first equality?)
That is, \(d^{\star}=p^{\star}\).
For convex problems, conditions for the Lagrange multiplier rule holds also yields strong duality, e.g., Slater’s condition.
Consider the dual of QP \[ \begin{array}{ll} \text{minimize}& \frac{1}{2}\mathbf{x}^T\mathbf{P}\mathbf{x} + \mathbf{q}^T\mathbf{x} \\ \text{subject to}& \mathbf{A}\mathbf{x} = \mathbf{b} \end{array} \] (\(\mathbf{P}\succ\mathbf{0}\)).
The Lagrangian is \[ \mathcal{L}(\mathbf{x}, \boldsymbol{\nu}) = \frac{1}{2}\mathbf{x}^T\mathbf{P}\mathbf{x} + \mathbf{q}^T\mathbf{x} + \boldsymbol{\nu}^T(\mathbf{A}\mathbf{x}-\mathbf{b}). \]
The dual function is \[ \begin{split} g(\boldsymbol{\nu}) &= \inf_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T\mathbf{P}\mathbf{x} + \mathbf{q}^T\mathbf{x} + \boldsymbol{\nu}^T(\mathbf{A}\mathbf{x}-\mathbf{b}) \\ &= \inf_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T\mathbf{P}\mathbf{x} + (\mathbf{A}^T\boldsymbol{\nu}+\mathbf{q})^T\mathbf{x} - \boldsymbol{\nu}^T\mathbf{b} \\ &= -\frac{1}{2}(\mathbf{A}^T\boldsymbol{\nu}+\mathbf{q})^T\mathbf{P}^{-1}(\mathbf{A}^T\boldsymbol{\nu}+\mathbf{q}) - \boldsymbol{\nu}^T\mathbf{b}. \end{split} \]
The infimum is attained with \[ \mathbf{x} = -\mathbf{P}^{-1}(\mathbf{A}^T\boldsymbol{\nu}+\mathbf{q}). \]
Thus the dual problem is an unconstraint QP, which is easier to solve than the primal.
The maximum of \(g(\boldsymbol{\nu})\) occurs at \[ \mathbf{0} = -\mathbf{b} - \mathbf{A}\mathbf{P}^{-1}(\mathbf{A}^T\boldsymbol{\nu}+\mathbf{q}) \] or \[ \boldsymbol{\nu} = -(\mathbf{A}\mathbf{P}^{-1}\mathbf{A})^{-1}(\mathbf{b}+\mathbf{A}\mathbf{P}^{-1}\mathbf{q}). \]
Therefore the primal optimum is attained at \[ \begin{split} \mathbf{x}^{\star} &= -\mathbf{P}^{-1}(-\mathbf{A}^T(\mathbf{A}\mathbf{P}^{-1}\mathbf{A})^{-1}(\mathbf{b}+\mathbf{A}\mathbf{P}^{-1}\mathbf{q})+\mathbf{q}) \\ &= \mathbf{P}^{-1}\mathbf{A}^T(\mathbf{A}\mathbf{P}^{-1}\mathbf{A})^{-1}(\mathbf{b}+\mathbf{A}\mathbf{P}^{-1}\mathbf{q}) - \mathbf{P}^{-1}\mathbf{q}. \end{split} \]
Note \[ \begin{split} \sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) &= \sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}} f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^p \nu_i h_i(\mathbf{x}) \\ &= \begin{cases} f_0(\mathbf{x}) & \text{if } \mathbf{x} \in C \\ \infty & \text{otherwise} \end{cases} = f_0(\mathbf{x}) + \iota_{C}(\mathbf{x}) \end{split} \]
where \[ C = \{\mathbf{x}: f_i(\mathbf{x})\le 0,~i=1,\dotsc,m,~h_i(\mathbf{x})=0,~i=1,\dotsc,p\}. \]
Thus weak duality is \[ \begin{split} d^{\star} = \sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}} g(\boldsymbol{\lambda}, \boldsymbol{\nu}) &= \sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}\inf_{\mathbf{x}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) \\ &\le \inf_{\mathbf{x}}\sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) \\ &= \inf_{\mathbf{x}\in C} f_0(\mathbf{x}) = p^{\star} \end{split} \] or min-max inequlity.
Strong duality amounts to prove the minimax theorem \[ \sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}\inf_{\mathbf{x}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}}\sup_{\boldsymbol{\lambda}\ge\mathbf{0}, \boldsymbol{\nu}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}). \]
Boyd & Vandenberghe (Ch. 5)
Lange, K., 2016. MM optimization algorithms (Vol. 147). SIAM. (Ch. 3)
Lange, K., 2010. Numerical analysis for statisticians. Springer Science & Business Media. (Ch. 11)