\[
\begin{eqnarray*}
&\text{minimize}& \mathbf{f}^T \mathbf{x} \\
&\text{subject to}& \|\mathbf{A}_i \mathbf{x} + \mathbf{b}_i\|_2 \le \mathbf{c}_i^T \mathbf{x} + d_i, \quad i = 1,\ldots,m \\
& & \mathbf{F} \mathbf{x} = \mathbf{g}
\end{eqnarray*}
\] over \(\mathbf{x} \in \mathbb{R}^d\). This says the points \((\mathbf{A}_i \mathbf{x} + \mathbf{b}_i, \mathbf{c}_i^T \mathbf{x} + d_i)\) live in the second order cone (ice cream cone, Lorentz cone, quadratic cone) \[
\begin{eqnarray*}
\mathbb{L}^{d+1} = \{(\mathbf{x}, t): \|\mathbf{x}\|_2 \le t\}
\end{eqnarray*}
in $\mathbb{R}^{d+1}$.
\]
QP is a special case of SOCP. Why?
When \(\mathbf{c}_i = \mathbf{0}\) for \(i=1,\ldots,m\), SOCP is equivalent to a quadratically constrained quadratic program (QCQP)\[\begin{eqnarray*}
&\text{minimize}& (1/2) \mathbf{x}^T \mathbf{P}_0 \mathbf{x} + \mathbf{q}_0^T \mathbf{x} \\
&\text{subject to}& (1/2) \mathbf{x}^T \mathbf{P}_i \mathbf{x} + \mathbf{q}_i^T \mathbf{x} + r_i \le 0, \quad i = 1,\ldots,m \\
& & \mathbf{A} \mathbf{x} = \mathbf{b},
\end{eqnarray*}\] where \(\mathbf{P}_i \in \mathbf{S}_+^n\), \(i=0,1,\ldots,m\). Why?
SOC representability
A convex set \(S \in \mathbb{R}^n\) is said to be second-order cone representable (SOCr) if \(S\) is the projection of a set in a higher-dimensional space that can be described by a set of second-order cone inequalities. Specifically, \[
\mathbf{x}\in S \iff \exists\mathbf{u}\in \mathbb{R}^m:
\left\| \mathbf{A}_i\begin{bmatrix} \mathbf{x} \\ \mathbf{u} \end{bmatrix} + \mathbf{b}_i \right\|_2
\le \mathbf{c}_i^T\begin{bmatrix} \mathbf{x} \\ \mathbf{u} \end{bmatrix} + d_i, \quad i=1, \dotsc, N.
\]
A convex function \(f: \mathbb{R}^n \to \mathbb{R}\) is said to be SOCr if and only if \(\text{epi}f\) is SOCr.
A rotated quadratic cone in \(\mathbb{R}^{d+1}\) is \[\begin{eqnarray*}
\mathbb{L}_r^{d+2} = \{(\mathbf{x}, t_1, t_2): \|\mathbf{x}\|_2^2 \le 2 t_1 t_2, \mathbf{x}\in\mathbb{R}^{d}, t_1 \ge 0, t_2 \ge 0\}.
\end{eqnarray*}\]
In fact \(\mathbb{L}_r^{d+2}\) is the \(\mathbb{L}^{d+2}\) rotated by 45\(^\circ\) in the \(t_1\)-\(t_2\) plane, hence 1-to-1 and onto: \[
\begin{bmatrix} \mathbf{x} \\ t_1 \\ t_2 \end{bmatrix}
\mapsto
\begin{bmatrix} \mathbf{x} \\ (t_1-t_2)/2 \\ (t_1+t_2)/2 \end{bmatrix}
\]
Gurobi allows users to input second order cone constraint and quadratic constraints directly.
Mosek allows users to input second order cone constraint, quadratic constraints, and rotated quadratic cone constraint directly.
SOC-representable sets
Elementary sets
(Epigraphs of affine functions) \(\mathbf{a}^T\mathbf{x} + b \le t \iff (0, t - \mathbf{a}^T\mathbf{x} - b) \in \mathbb{L}^{2}\).
(Euclidean norms) \(\|\mathbf{x}\|_2 \le t \Leftrightarrow (\mathbf{x}, t) \in \mathbb{L}^{d+1}\).
(Absolute values) \(|x| \le t \Leftrightarrow (x, t) \in \mathbb{L}^2\).
(Sum of squares) \(\|\mathbf{x}\|_2^2 \le t \Leftrightarrow (\mathbf{x}, t, 1/2) \in \mathbb{L}_r^{d+2}\).
(Epigraphs of quadratic-fractional functions) Let \[
g(\mathbf{x}, s) = \begin{cases} \frac{\mathbf{x}^T\mathbf{x}}{s}, & s > 0 \\ 0, & s=0, \mathbf{x}=\mathbf{0} \\ \infty, & \text{otherwise} \end{cases}
\] Then, \(\text{epi}g = \{(\mathbf{x}, t): \mathbf{x}^T\mathbf{x}/s \le t, s \ge 0 \} = \mathbb{L}_r^{d+2}\).
(Hyperbola) \(\{(t,s)\in\mathbb{R}^2: ts \ge 1, t > 0\} = \text{epi}\tilde{g}\) with \(\tilde{g}(s) = g(1,s)\) for the \(g\) above.
Operations that preserve SOCr of sets
Intersection.
Cartesian product.
Minkowski sum.
Image \(\mathcal{A}C\) of a SOCr set \(C\) under an linear (affine) map \(\mathcal{A}\).
Inverse image \(\mathcal{A}^{-1}C\) of a SOCr set \(C\) under a linear (affine) map \(\mathcal{A}\).
Operations that preserve SOCr of functions
Nonnegative weighted sum.
Composition with an affine mapping.
Pointwise maximum and supremum.
Separable sum. \(g(\mathbf{x}_1,\dotsc,\mathbf{x}_m) = g_1(\mathbf{x}_1) + \dotsb + g_m(\mathbf{x}_m)\) is SOCr if each summand is so.
Paritial minimization.
Perspective. If \(f(\mathbf{x})\) is SOCr, then its perspective\(g(\mathbf{x}, t)=t f(t^{-1}\mathbf{x})\) is SOCr.
More complex (but useful) sets
(Ellipsoids) For \(\mathbf{P} \in \mathbb{S}_+^n\), we can find \(\mathbf{F}\in \mathbb{R}^{d \times k}\) such that \(\mathbf{P} = \mathbf{F}^T \mathbf{F}\). Then
(Harmonic mean) The hypograph of the harmonic mean function \(\left( n^{-1} \sum_{i=1}^n x_i^{-1} \right)^{-1}\) is SOCr.
(Rational \(\ell_p\)-norm) Function \(\|\mathbf{x}\|_p = (\sum_{i=1}^d |x_i|^p)^{1/p}\) for \(p \ge 1\) rational is SOCr.
Now our catalogue of SOCP terms includes all above terms.
Most of these function are implemented as the built-in function in the convex optimization modeling language cvx (for Matlab) or Convex.jl (for Julia).
Example 1: group lasso
In many applications, we need to perform variable selection at group level. For instance, in factorial analysis, we want to select or de-select the group of regression coefficients for a factor simultaneously. Yuan and Lin (2006) propose the group lasso that \[\begin{eqnarray*}
&\text{minimize}& \frac 12 \|\mathbf{y} - \beta_0 \mathbf{1} - \mathbf{X} \beta\|_2^2 + \lambda \sum_{g=1}^G w_g \|\beta_g\|_2,
\end{eqnarray*}\] where \(\beta_g\) is the subvector of regression coefficients for group \(g\), and \(w_g\) are fixed group weights. This is equivalent to the SOCP \[\begin{eqnarray*}
&\text{minimize}& \frac 12 \beta^T \mathbf{X}^T \left(\mathbf{I} - \frac{\mathbf{1} \mathbf{1}^T}{n} \right) \mathbf{X} \beta + \mathbf{y}^T \left(\mathbf{I} - \frac{\mathbf{1} \mathbf{1}^T}{n} \right) \mathbf{X} \beta + \lambda \sum_{g=1}^G w_g t_g \\
&\text{subject to}& \|\beta_g\|_2 \le t_g, \quad g = 1,\ldots, G,
\end{eqnarray*}\] in variables \(\beta\) and \(t_1,\ldots,t_G\).
Overlapping groups are allowed here.
usingStatisticsusingRandom, LinearAlgebra, SparseArraysRandom.seed!(123) # seedn, p =100, 10# Design matrixX0 =randn(n, p)X = [ones(n, 1) (X0 .-mean(X0 ,dims=1)) ./std(X0, dims=1)] # design matrix is standardized and includes intercept# True regression coefficients (first 5 are non-zero)β = [1.0; randn(5); zeros(p -5)]# Responsesy = X * β +randn(n)# plot the true βusingPlots; gr()plt =plot(1:p, β[2:end], line=:stem, marker=2, legend=:none)xlabel!(plt, "coordinate")ylabel!(plt, "beta_j")title!(plt, "True beta")
Sparse group lasso \[\begin{eqnarray*}
&\text{minimize}& \frac 12 \|\mathbf{y} - \beta_0 \mathbf{1} - \mathbf{X} \beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \sum_{g=1}^G w_g \|\beta_g\|_2
\end{eqnarray*}\] achieves sparsity at both group and individual coefficient level and can be solved by SOCP as well.
We have seen zero-sum group lasso before.
Apparently we can solve any previous loss functions (quantile, \(\ell_1\), composite quantile, Huber, multi-response model) plus group or sparse group penalty by SOCP.
Belloni, Chernozhukov, and Wang (2011) minimizes \[
\|\mathbf{y} - \beta_0 \mathbf{1} - \mathbf{X} \beta\|_2 + \lambda \|\beta\|_1
\] by SOCP. This variant generates the solution path that is one-to-one with lasso (why?) but simplifies the choice of \(\lambda\), as it does not depend on the variance of the additive noise.
# solve at a grid of λλgrid =0:10:200β̂path =zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λβ̂lasso =Variable(size(X, 2))@timefor i in1:length(λgrid) λ = λgrid[i]# objective problem =minimize(0.5sumsquares(y - X * β̂lasso) + λ *norm(β̂lasso[2:end], 1)) solver = Mosek.Optimizer() MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 0) solve!(problem, solver) β̂path[i, :] = β̂lasso.valueendplt =plot(collect(λgrid), β̂path, legend=:none)xlabel!(plt, "lambda")ylabel!(plt, "beta_hat")title!(plt, "Lasso")
0.194603 seconds (162.87 k allocations: 16.071 MiB, 12.33% compilation time)
λlist =0:0.5:10βpath =zeros(length(λlist), size(X, 2)) # each row is β̂ at a λβsqrtlasso =Variable(size(X, 2))@timefor i =1:length(λlist) λ = λlist[i] problem =minimize(norm(y - X * βsqrtlasso, 2) + λ *norm(βsqrtlasso[2:end], 1) ) solver = Mosek.Optimizer() MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 0) solve!(problem, solver) βpath[i, :] = βsqrtlasso.valueend
0.361814 seconds (306.92 k allocations: 25.685 MiB, 6.91% gc time, 49.63% compilation time)
T. Sun and C.H. Zhang propose the scaled lasso \[
\min_{\sigma > 0, \beta_0, \beta} ~ \frac{\sigma}{2} + \frac{\|\mathbf{y} - \beta_0 \mathbf{1} - \mathbf{X} \beta\|_2^2}{2\sigma} + \lambda \|\beta\|_1
.
\]
By the AM-GM inequality, it is immediate that the scaled lasso is equivalent to the square root lasso.
From the SOC representability of the epigraphs of quadratic-fractional functions, this alternative formulation is also SOCP.
λlist =0:0.5:10βpath =zeros(length(λlist), size(X, 2)) # each row is β̂ at a λβscaledlasso =Variable(size(X, 2))t =Variable()@timefor i =1:length(λlist) λ = λlist[i]# watch the `quaroverlin` function: https://jump.dev/Convex.jl/stable/operations/#Second-Order-Cone-Representable-Functions problem =minimize(0.5t +0.5quadoverlin(y - X * βscaledlasso, t) + λ *norm(βscaledlasso[2:end], 1)) solver = Mosek.Optimizer() MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 0) solve!(problem, solver) βpath[i, :] = βscaledlasso.valueendplt =plot(collect(λlist), βpath, legend=:none)xlabel!(plt, "lambda")ylabel!(plt, "beta_hat")title!(plt, "Scaled Lasso")
0.540314 seconds (450.59 k allocations: 35.594 MiB, 4.17% gc time, 67.51% compilation time)
SOCP example: \(\ell_p\)-norm regression
\(\ell_p\) regression with \(p \ge 1\) a rational number \[\begin{eqnarray*}
&\text{minimize}& \|\mathbf{y} - \mathbf{X} \beta\|_p
\end{eqnarray*}\] can be formulated as a SOCP. Why? For instance, \(\ell_{3/2}\) regression combines advantage of both robust \(\ell_1\) regression and least squares.
Let \(\mathbf{Y}\in\mathbb{R}^{n\times m}\) represent the image. The isotropic total variation (TV) denoising minimizes \[
\frac{1}{2}\|\mathbf{Y}-\mathbf{X}\|_{\rm F}^2 + \lambda \sum_{i,j}\sqrt{(x_{i+1,j}-x_{ij})^2 + (x_{i,j+1}-x_{ij})^2}, \quad \lambda > 0,
\] where \(\mathbf{X}=(x_{ij})\) is the optimization variable.
SOCP method:
Goldfarb, D. and Yin, W., 2005. Second-order cone programming methods for total variation-based image restoration. SIAM Journal on Scientific Computing, 27(2), pp.622-645. https://doi.org/10.1137/040608982