This lecture gives an overview of some optimization tools in Julia.
Commercial solvers MOSEK and Gurobi need to be installed for this session.
Category of optimization problems:
Problems with analytical solutions: least squares, principle component analysis, canonical correlation analysis, …
Problems subject to Disciplined Convex Programming (DCP): linear programming (LP), quadratic programming (QP), second-order cone programming (SOCP), semidefinite programming (SDP), and geometric programming (GP).
Nonlinear programming (NLP): Newton type algorithms, Fisher scoring algorithm, EM algorithm, MM algorithms.
Large scale optimization: ADMM, SGD, …
optimization flowchart
Modeling tools and solvers
Getting familiar with good optimization softwares broadens the scope and scale of problems we are able to solve in statistics. Following table lists some of the best optimization softwares.
LP
MILP
SOCP
MISOCP
SDP
GP
NLP
MINLP
R
Matlab
Julia
Python
Cost
modeling tools
AMPL
x
x
x
x
x
x
x
x
x
x
x
$
cvx
x
x
x
x
x
x
x
x
A
Convex.jl
x
x
x
x
x
x
O
JuMP.jl
x
x
x
x
x
x
x
O
MathProgBase.jl
x
x
x
x
x
x
x
O
MathOptInterface.jl
x
x
x
x
x
x
x
O
convex solvers
Mosek
x
x
x
x
x
x
x
x
x
x
x
A
Gurobi
x
x
x
x
x
x
x
x
A
CPLEX
x
x
x
x
x
x
x
x
A
SCS
x
x
x
x
x
x
O
NLP solvers
NLopt
x
x
x
x
x
O
Ipopt
x
x
x
x
x
O
KNITRO
x
x
x
x
x
x
x
x
$
O: open source
A: free academic license
$: commercial
Modeling tools v. solvers
Solvers
Mosek, Gurobi, Cplex, SCS, etc are concrete software implementation of optimization algorithms.
Mosek/Gurobi/SCS for convex optimization
Ipopt/NLopt for nonlinear programming
Mosek and Gurobi are commercial software but free for academic use. SCS/Ipopt/NLopt are open source.
Users need to implement the problem to solve using their application programming interface (API). Example
Modeling tools
AMPL (comercial, https://ampl.com) is an algebraic modeling language that allows describe the optimization problem most close to its mathematical formulation. Sample model
cvx (for Matlab) and Convex.jl (Julia) implement the disciplined convex programming (DCP) paradigm proposed by Grant and Boyd (2008). DCP prescribes a set of simple rules from which users can construct convex optimization problems easily.
Convex.jl interfaces with actural problem solvers via MathOptInterface, an abstraction layer for mathematical optimization solvers in Julia.
For example, MOSEK is an actual interior-point solver for convex and mixed-integer programs. It provides APIs that can be accessed in low-level using Mosek.jl, which is abstractized by MosekTools.jl, which in turn implements MathOptInterface.
Modeling tools usually have the capability to use a variety of solvers. But modeling tools are solver agnostic so users do not have to worry about specific solver interface.
DCP Using Convex.jl
Standard convex problem classes like LP (linear programming), QP (quadratic programming), SOCP (second-order cone programming), SDP (semidefinite programming), and GP (geometric programming), are becoming a technology.
Example: microbiome regression analysis – compositional data
microbiome
We illustrate optimization tools in Julia using microbiome analysis as an example.
Diversity of gut microbiome is believed to be an important factor for human health or diseases such as obesity; respiratory microbiome affects pulmonary function in an HIV-infected population.
16S microbiome sequencing techonology generates sequence counts of various organisms (operational taxonomic units, or OTUs) in samples.
The major interest is the composition of the microbiome in the population, hence for statistical analysis, OTU counts are normalized into proportions for each sample.
Thus if there are \(p\) OTUs, each observation \(\mathbf{z}_i\) lies in the positive probability simplex \(\Delta_{p-1,+}=\{z_1, \dotsc, z_p : z_j > 0, \sum_{j=1}^p z_j=1\}\). This unit-sum constraint introduces dependency between the \(p\) variables, causing intrinsic difficulties in providing sensible interpretations for the regression parameters. A well-known resolution of this difficulty is the log-ratio transfomation due to Aitchson, which constructs the data matrix \(\tilde{\mathbf{X}}=(\log\frac{z_{ij}}{z_{ip}}) \in \mathbb{R}^{n\times (p-1)}\). Thus the regression model is \[
\mathbf{y} = \tilde{\mathbf{X}}\tilde{\boldsymbol\beta} + \boldsymbol{\varepsilon}, \quad \tilde{\boldsymbol\beta} \in \mathbb{R}^{p-1}.
\]
By introducing an extra coefficient \(\beta_p = -\sum_{j=1}^{p-1}\beta_j\), and letting \(\mathbf{X}=(\log z_{ij})\), the above, assymetric model becomes symmetric: \[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\beta} \in \mathbb{R}^{p}.
\]
Zero-sum regression
In other words, we need to solve a zero-sum least squares problem \[
\begin{array}{ll}
\text{minimize} & \frac{1}{2} \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 \\
\text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0.
\end{array}
\] For details, see
Lin, W., Shi, P., Feng, R. and Li, H., 2014. Variable selection in regression with compositional covariates. Biometrika, 101(4), pp.785-797. https://doi.org/10.1093/biomet/asu031
The sum-to-zero contrained least squares is a standard quadratic programming (QP) problem so should be solved easily by any QP solver.
For simplicity we ignore intercept and non-OTU covariates in this presentation.
Let’s first generate an artifical dataset:
usingRandom, LinearAlgebra, SparseArraysRandom.seed!(123) # seedn, p =100, 50X =rand(n, p)lmul!(Diagonal(1./vec(sum(X, dims=2))), X)β =sprandn(p, 0.1) # sparse vector with about 10% non-zero entriesy = X * β +randn(n);
Modeling using Convex.jl
We use the Convex.jl package to model this QP problem. For a complete list of operations supported by Convex.jl, see how close the code is to the mathematical formulation above.
After installing Gurobi, set up the environmental variables. Set up the environmental variables. On my machine, I put following two lines in the ~/.julia/config/startup.jl file:
Suppose we want to know which organisms (OTUs) are associated with the response. We can answer this question using a zero-sum contrained lasso \[
\begin{array}{ll}
\text{minimize} & \frac 12 \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1 \\
\text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0.
\end{array}
\]
Varying \(\lambda\) from small to large values will generate a solution path.
# solve at a grid of λλgrid =0:0.01:0.35β̂path =zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λβ̂classo =Variable(size(X, 2))@timefor i in1:length(λgrid) λ = λgrid[i]# objective problem =minimize(0.5sumsquares(y - X * β̂classo) + λ *sum(abs, β̂classo))# constraint problem.constraints +=sum(β̂classo) ==0# constraint# solver = Hypatia.Optimizer() # try different solvers# MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), 0) # keep silent solver = COSMO.Optimizer() MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), false) # keep silent solve!(problem, solver) β̂path[i, :] = β̂classo.valueend
1.404733 seconds (1.71 M allocations: 310.095 MiB, 48.15% gc time, 0.96% compilation time)
Suppose we want to do variable selection not at the OTU level, but at the Phylum level. OTUs are clustered into various Phyla. We can answer this question using a sum-to-zero contrained group lasso \[
\begin{array}{ll}
\text{minimize} & \frac 12 \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 + \lambda \sum_j \|\boldsymbol{\beta}_j\|_2 \\
\text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0,
\end{array}
\] where \(\boldsymbol{\beta}_j\) is the \(j\)th partition of the regression coefficients corresponding to the \(j\)th phylum. This is a second-order cone programming (SOCP) problem readily modeled by Convex.jl.
Let’s assume each 10 contiguous OTUs belong to one Phylum.
# solve at a grid of λλgrid =0.1:0.005:0.5β̂pathgrp =zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λβ̂classo =Variable(size(X, 2))@timefor i in1:length(λgrid) λ = λgrid[i]# loss obj =0.5sumsquares(y - X * β̂classo)# group lasso penalty termfor j in1:(size(X, 2)/10) βj = β̂classo[(10(j-1)+1):10j] obj = obj + λ *norm(βj)end problem =minimize(obj)# constraint problem.constraints +=sum(β̂classo) ==0# constraint solver = Mosek.Optimizer() # MOSEK this time! MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 0) # keep silentsolve!(problem, solver) β̂pathgrp[i, :] = β̂classo.valueend
0.846394 seconds (679.84 k allocations: 170.177 MiB, 6.09% gc time)
It took Mosek <2 seconds to solve this seemingly hard optimization problem at 80 different \(\lambda\) values.
p2 =plot(collect(λgrid), β̂pathgrp, legend=:none)xlabel!(p2, "lambda")ylabel!(p2, "beta_hat")title!(p2, "Zero-Sum Group Lasso")
Example: matrix completion
Load the \(128 \times 128\)Lena image with missing pixels.
usingImageIO, FileIO # detect file formats and dispatch to appropriate readers/writerslena =load("lena128missing.png")
We fill out the missing pixels using a matrix completion technique developed by Candes and Tao \[
\begin{array}{ll}
\text{minimize} & \|\mathbf{X}\|_* \\
\text{subject to} & x_{ij} = y_{ij} \text{ for all observed entries } (i, j).
\end{array}
\]
Here \(\|\mathbf{X}\|_* = \sum_{i=1}^{\min(m,n)} \sigma_i(\mathbf{X})\) is the nuclear norm. It can be shown that \[
\|\mathbf{X}\|_* = \sup_{\|\mathbf{Y}\|_2 \le 1} \langle \mathbf{X}, \mathbf{Y} \rangle,
\] where \(\|\mathbf{Y}\|_2=\sigma_{\max}(\mathbf{Y})\) is the spectral (operator 2-) norm, and \(\langle \mathbf{X}, \mathbf{Y} \rangle = \text{tr}(\mathbf{X}^T\mathbf{Y})\). That is, \(\|\cdot\|_*\) is the dual norm of \(\|\cdot\|_2\).
The nuclear norm can be considered as the best convex approximation to \(\text{rank}(M)\), just like \(\|\mathbf{x}\|_1\) is the best convex approximation to \(\|\mathbf{x}\|_0\). We want the matrix with the lowest rank that agrees with the observed entries, but instead seek one with the minimal nuclear norm as a convex relaxation.
This is a semidefinite programming (SDP) problem readily modeled by Convex.jl.
This example takes long because of high dimensionality.
# Use COSMO solverusingCOSMOsolver = COSMO.Optimizer()## Use SCS solver# using SCS# solver = SCS.Optimizer()## Use Mosek solver# using Mosek# solver = Mosek.Optimizer()## Use Hypatia solver# using Hypatia# solver = Hypatia.Optimizer()# Linear indices of obs. entriesobsidx =findall(Y[:] .≠0.0)# Create optimization variablesX = Convex.Variable(size(Y))# Set up optmization problemproblem =minimize(nuclearnorm(X))problem.constraints += X[obsidx] == Y[obsidx]# Solve the problem by calling solve@timesolve!(problem, solver)
------------------------------------------------------------------
COSMO v0.8.8 - A Quadratic Objective Conic Solver
Michael Garstka
University of Oxford, 2017 - 2022
------------------------------------------------------------------
Problem: x ∈ R^{49153},
constraints: A ∈ R^{73665x49153} (73793 nnz),
matrix size to factor: 122818x122818,
Floating-point precision: Float64
Sets: ZeroSet of dim: 40769
DensePsdConeTriangle of dim: 32896 (256x256)
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
ρ = 0.1, σ = 1e-06, α = 1.6,
max_iter = 5000,
scaling iter = 10 (on),
check termination every 25 iter,
check infeasibility every 40 iter,
KKT system solver: QDLDL
Acc: Anderson Type2{QRDecomp},
Memory size = 15, RestartedMemory,
Safeguarded: true, tol: 2.0
Setup Time: 113.52ms
Iter: Objective: Primal Res: Dual Res: Rho:
1 -1.4426e+03 1.1678e+01 5.9856e-01 1.0000e-01
25 1.4495e+02 5.5360e-02 1.1033e-03 1.0000e-01
50 1.4754e+02 1.2369e-02 1.6744e-03 7.4179e-01
75 1.4797e+02 4.9490e-04 5.5696e-05 7.4179e-01
100 1.4797e+02 1.4115e-05 1.3438e-06 7.4179e-01
------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 100
Optimal objective: 148
Runtime: 2.193s (2193.48ms)
7.279706 seconds (6.41 M allocations: 966.528 MiB, 5.33% gc time, 73.62% compilation time: 14% of which was recompilation)
usingImagescolorview(Gray, X.value)
Nonlinear programming (NLP)
We use MLE of Gamma distribution to illustrate some rudiments of nonlinear programming (NLP) in Julia.
Let \(x_1,\ldots,x_m\) be a random sample from the gamma density with shape parameter \(\alpha\) and rate parameter \(\beta\): \[
\small
f(x) = \Gamma(\alpha)^{-1} \beta^{\alpha} x^{\alpha-1} e^{-\beta x}
\] on \((0,\infty)\). The log likelihood function is \[
\small
L(\alpha, \beta) = m [- \ln \Gamma(\alpha) + \alpha \ln \beta + (\alpha - 1)\overline{\ln x} - \beta \bar x],
\] where \(\overline{x} = \frac{1}{m} \sum_{i=1}^m x_i\) and \(\overline{\ln x} = \frac{1}{m} \sum_{i=1}^m \ln x_i\).