Optimization in Julia

Advanced Statistical Computing

Joong-Ho Won

Seoul National University

November 2023

versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 2 on 8 virtual cores
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
Status `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/15-juliaopt/Project.toml`
  [1e616198] COSMO v0.8.8
  [f65535da] Convex v0.15.3
  [31c24e10] Distributions v0.25.103
  [5789e2e9] FileIO v1.16.1
  [f6369f11] ForwardDiff v0.10.36
  [b99e6be6] Hypatia v0.7.3
  [82e4d734] ImageIO v0.6.7
  [916415d5] Images v0.26.0
  [b6b21f68] Ipopt v1.5.1
  [4076af6c] JuMP v1.16.0
  [b8f27783] MathOptInterface v1.22.0
  [6405355b] Mosek v10.1.3
  [1ec41992] MosekTools v0.15.1
  [76087f3c] NLopt v1.0.0
  [91a5bcdd] Plots v1.39.0
  [c946c3f1] SCS v1.3.1
  [276daf66] SpecialFunctions v2.3.1

Overview

This lecture gives an overview of some optimization tools in Julia.

Commercial solvers MOSEK and Gurobi need to be installed for this session.

Category of optimization problems:

  1. Problems with analytical solutions: least squares, principle component analysis, canonical correlation analysis, …
  2. Problems subject to Disciplined Convex Programming (DCP): linear programming (LP), quadratic programming (QP), second-order cone programming (SOCP), semidefinite programming (SDP), and geometric programming (GP).
  3. Nonlinear programming (NLP): Newton type algorithms, Fisher scoring algorithm, EM algorithm, MM algorithms.
  4. Large scale optimization: ADMM, SGD, …

optimization flowchart

Modeling tools and solvers

Getting familiar with good optimization softwares broadens the scope and scale of problems we are able to solve in statistics. Following table lists some of the best optimization softwares.

LP MILP SOCP MISOCP SDP GP NLP MINLP R Matlab Julia Python Cost
modeling tools
AMPL x x x x x x x x x x x $
cvx x x x x x x x x A
Convex.jl x x x x x x O
JuMP.jl x x x x x x x O
MathProgBase.jl x x x x x x x O
MathOptInterface.jl x x x x x x x O
convex solvers
Mosek x x x x x x x x x x x A
Gurobi x x x x x x x x A
CPLEX x x x x x x x x A
SCS x x x x x x O
NLP solvers
NLopt x x x x x O
Ipopt x x x x x O
KNITRO x x x x x x x x $
  • O: open source
  • A: free academic license
  • $: commercial

Modeling tools v. solvers

Solvers

  • Mosek, Gurobi, Cplex, SCS, etc are concrete software implementation of optimization algorithms.
    • Mosek/Gurobi/SCS for convex optimization
    • Ipopt/NLopt for nonlinear programming
    • Mosek and Gurobi are commercial software but free for academic use. SCS/Ipopt/NLopt are open source.
  • Users need to implement the problem to solve using their application programming interface (API). Example

Modeling tools

  • AMPL (comercial, https://ampl.com) is an algebraic modeling language that allows describe the optimization problem most close to its mathematical formulation. Sample model

  • cvx (for Matlab) and Convex.jl (Julia) implement the disciplined convex programming (DCP) paradigm proposed by Grant and Boyd (2008). DCP prescribes a set of simple rules from which users can construct convex optimization problems easily.

  • Convex.jl interfaces with actural problem solvers via MathOptInterface, an abstraction layer for mathematical optimization solvers in Julia.

    • For example, MOSEK is an actual interior-point solver for convex and mixed-integer programs. It provides APIs that can be accessed in low-level using Mosek.jl, which is abstractized by MosekTools.jl, which in turn implements MathOptInterface.
  • Modeling tools usually have the capability to use a variety of solvers. But modeling tools are solver agnostic so users do not have to worry about specific solver interface.

DCP Using Convex.jl

Standard convex problem classes like LP (linear programming), QP (quadratic programming), SOCP (second-order cone programming), SDP (semidefinite programming), and GP (geometric programming), are becoming a technology.

Example: microbiome regression analysis – compositional data

microbiome

  • We illustrate optimization tools in Julia using microbiome analysis as an example.

  • Diversity of gut microbiome is believed to be an important factor for human health or diseases such as obesity; respiratory microbiome affects pulmonary function in an HIV-infected population.

  • 16S microbiome sequencing techonology generates sequence counts of various organisms (operational taxonomic units, or OTUs) in samples.

  • The major interest is the composition of the microbiome in the population, hence for statistical analysis, OTU counts are normalized into proportions for each sample.

  • Thus if there are \(p\) OTUs, each observation \(\mathbf{z}_i\) lies in the positive probability simplex \(\Delta_{p-1,+}=\{z_1, \dotsc, z_p : z_j > 0, \sum_{j=1}^p z_j=1\}\). This unit-sum constraint introduces dependency between the \(p\) variables, causing intrinsic difficulties in providing sensible interpretations for the regression parameters. A well-known resolution of this difficulty is the log-ratio transfomation due to Aitchson, which constructs the data matrix \(\tilde{\mathbf{X}}=(\log\frac{z_{ij}}{z_{ip}}) \in \mathbb{R}^{n\times (p-1)}\). Thus the regression model is \[ \mathbf{y} = \tilde{\mathbf{X}}\tilde{\boldsymbol\beta} + \boldsymbol{\varepsilon}, \quad \tilde{\boldsymbol\beta} \in \mathbb{R}^{p-1}. \]

  • By introducing an extra coefficient \(\beta_p = -\sum_{j=1}^{p-1}\beta_j\), and letting \(\mathbf{X}=(\log z_{ij})\), the above, assymetric model becomes symmetric: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\beta} \in \mathbb{R}^{p}. \]

Zero-sum regression

  • In other words, we need to solve a zero-sum least squares problem
    \[ \begin{array}{ll} \text{minimize} & \frac{1}{2} \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 \\ \text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0. \end{array} \] For details, see

Lin, W., Shi, P., Feng, R. and Li, H., 2014. Variable selection in regression with compositional covariates. Biometrika, 101(4), pp.785-797. https://doi.org/10.1093/biomet/asu031

  • The sum-to-zero contrained least squares is a standard quadratic programming (QP) problem so should be solved easily by any QP solver.

  • For simplicity we ignore intercept and non-OTU covariates in this presentation.

  • Let’s first generate an artifical dataset:

using Random, LinearAlgebra, SparseArrays

Random.seed!(123) # seed

n, p = 100, 50
X = rand(n, p)
lmul!(Diagonal(1 ./ vec(sum(X, dims=2))), X)
β = sprandn(p, 0.1) # sparse vector with about 10% non-zero entries
y = X * β + randn(n);

Modeling using Convex.jl

We use the Convex.jl package to model this QP problem. For a complete list of operations supported by Convex.jl, see how close the code is to the mathematical formulation above.

using Convex

β̂cls = Variable(size(X, 2))
problem = minimize(0.5sumsquares(y - X * β̂cls)) # objective
problem.constraints += sum(β̂cls) == 0; # sum-to-zero constraint
Mosek

We first use the Mosek solver to solve this QP.

using MosekTools, MathOptInterface  
const MOI = MathOptInterface

solver = Mosek.Optimizer()
MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 1)

@time solve!(problem, solver)
Problem
  Name                   :                 
  Objective sense        : minimize        
  Type                   : CONIC (conic optimization problem)
  Constraints            : 3               
  Affine conic cons.     : 2 (104 rows)
  Disjunctive cons.      : 0               
  Cones                  : 0               
  Scalar variables       : 53              
  Matrix variables       : 0               
  Integer variables      : 0               

Optimizer started.
Presolve started.
Linear dependency checker started.
Linear dependency checker terminated.
Eliminator started.
Freed constraints in eliminator : 0
Eliminator terminated.
Eliminator - tries                  : 1                 time                   : 0.00            
Lin. dep.  - tries                  : 1                 time                   : 0.00            
Lin. dep.  - primal attempts        : 1                 successes              : 1               
Lin. dep.  - dual attempts          : 0                 successes              : 0               
Lin. dep.  - primal deps.           : 0                 dual deps.             : 0               
Presolve terminated. Time: 0.00    
Optimizer  - threads                : 8               
Optimizer  - solved problem         : the dual        
Optimizer  - Constraints            : 52              
Optimizer  - Cones                  : 3               
Optimizer  - Scalar variables       : 106               conic                  : 106             
Optimizer  - Semi-definite variables: 0                 scalarized             : 0               
Factor     - setup time             : 0.00            
Factor     - dense det. time        : 0.00              GP order time          : 0.00            
Factor     - nonzeros before factor : 1328              after factor           : 1378            
Factor     - dense dim.             : 0                 flops                  : 3.11e+05        
ITE PFEAS    DFEAS    GFEAS    PRSTATUS   POBJ              DOBJ              MU       TIME  
0   2.5e+00  5.0e-01  2.0e+00  0.00e+00   -0.000000000e+00  -1.000000000e+00  1.0e+00  0.00  
1   2.5e-01  5.0e-02  5.6e-01  -9.04e-01  2.761646440e+00   8.939118846e+00   9.9e-02  0.01  
2   4.1e-02  8.2e-03  1.2e-01  -6.30e-01  1.477236633e+01   2.736777887e+01   1.6e-02  0.01  
3   2.0e-03  3.9e-04  6.4e-04  8.73e-01   2.173581519e+01   2.186559731e+01   7.9e-04  0.01  
4   1.2e-06  2.3e-07  1.1e-08  1.00e+00   2.178246695e+01   2.178258168e+01   4.7e-07  0.01  
5   1.2e-10  2.5e-11  1.6e-14  1.00e+00   2.178268480e+01   2.178268481e+01   4.8e-11  0.01  
Optimizer terminated. Time: 0.01    

  0.091715 seconds (6.57 k allocations: 1.959 MiB)
# Check the status, optimal value, and minimizer of the problem
problem.status, problem.optval, β̂cls.value
(MathOptInterface.OPTIMAL, 21.78268479990576, [8.821527219381636; 33.70209377680733; … ; -22.361993206260447; 12.468500909230388;;])
Gurobi

After installing Gurobi, set up the environmental variables. Set up the environmental variables. On my machine, I put following two lines in the ~/.julia/config/startup.jl file:

ENV["GUROBI_HOME"] = "/Library/gurobi1003/macos_universal2/"
ENV["GRB_LICENSE_FILE"] = "/Users/jhwon/gurobi/gurobi.lic"

Switch to Gurobi solver:

using Gurobi

solver = Gurobi.Optimizer()
MOI.set(solver, MOI.RawOptimizerAttribute("OutputFlag"), 1)

@time solve!(problem, solver)
Set parameter Username
Academic license - for non-commercial use only - expires 2024-11-14
Gurobi Optimizer version 10.0.3 build v10.0.3rc0 (mac64[x86])

CPU model: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 107 rows, 157 columns and 5160 nonzeros
Model fingerprint: 0x2f6aa477
Model has 2 quadratic constraints
Coefficient statistics:
  Matrix range     [1e-05, 2e+00]
  QMatrix range    [1e+00, 1e+00]
  Objective range  [1e+00, 1e+00]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e-02, 2e+00]
Presolve removed 2 rows and 1 columns
Presolve time: 0.00s
Presolved: 105 rows, 156 columns, 5158 nonzeros
Presolved model has 2 second-order cone constraints
Ordering time: 0.00s

Barrier statistics:
 Free vars  : 50
 AA' NZ     : 5.154e+03
 Factor NZ  : 5.262e+03
 Factor Ops : 3.590e+05 (less than 1 second per iteration)
 Threads    : 1

                  Objective                Residual
Iter       Primal          Dual         Primal    Dual     Compl     Time
   0   1.01905077e+01 -5.01000000e-01  2.07e+01 1.00e-01  1.81e-01     0s
   1   2.85279977e+00 -2.62345360e-01  3.98e+00 2.92e-05  3.52e-02     0s
   2   2.65921342e+00  4.72089921e+00  3.43e+00 3.59e-05  1.39e-02     0s
   3   9.46149510e+00  1.32723341e+01  1.36e+00 4.73e-06  1.54e-02     0s
   4   2.27842360e+01  7.96792805e+00  1.03e-02 7.55e-07  1.40e-01     0s
   5   2.29865080e+01  2.06571541e+01  1.13e-08 4.94e-07  2.20e-02     0s
   6   2.19990895e+01  2.17086651e+01  1.55e-10 1.92e-11  2.74e-03     0s
   7   2.17829284e+01  2.17825422e+01  4.19e-11 5.93e-09  3.63e-06     0s
   8   2.17826852e+01  2.17826847e+01  1.15e-09 1.12e-10  5.20e-09     0s

Barrier solved model in 8 iterations and 0.01 seconds (0.01 work units)
Optimal objective 2.17826852e+01


User-callback calls 72, time in user-callback 0.00 sec
  0.011574 seconds (7.64 k allocations: 2.404 MiB)
# Check the status, optimal value, and minimizer of the problem
problem.status, problem.optval, β̂cls.value
(MathOptInterface.OPTIMAL, 21.7826851961337, [8.821572026387795; 33.70227455280494; … ; -22.362109412726067; 12.468559726470138;;])
SCS

Switch to the open source SCS solver:

# Use SCS solver
using SCS

solver = SCS.Optimizer()
MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), 1)

@time solve!(problem, solver)
------------------------------------------------------------------
           SCS v3.2.3 - Splitting Conic Solver
    (c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 53, constraints m: 107
cones:    z: primal zero / dual free vars: 2
      l: linear vars: 1
      q: soc vars: 104, qsize: 2
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
      alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
      max_iters: 100000, normalize: 1, rho_x: 1.00e-06
      acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-direct-amd-qdldl
      nnz(A): 5056, nnz(P): 0
------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 1.10e+01  5.52e+00  4.15e+01  1.57e+01  1.00e-01  1.32e-03 
   150| 3.46e-04  3.18e-05  4.03e-05  2.18e+01  1.00e-01  3.49e-03 
------------------------------------------------------------------
status:  solved
timings: total: 3.49e-03s = setup: 1.26e-03s + solve: 2.23e-03s
     lin-sys: 1.97e-03s, cones: 4.26e-05s, accel: 4.61e-05s
------------------------------------------------------------------
objective = 21.783059
------------------------------------------------------------------
  0.005533 seconds (5.04 k allocations: 1.438 MiB)
# Check the status, optimal value, and minimizer of the problem
problem.status, problem.optval, β̂cls.value
(MathOptInterface.OPTIMAL, 21.7830788756544, [8.82152723106786; 33.70209380959717; … ; -22.361993226157978; 12.468500924793684;;])
COSMO

Switch to the open source COSMO solver:

# Use COSMO solver
using COSMO

solver = COSMO.Optimizer()
MOI.set(solver, MOI.RawOptimizerAttribute("max_iter"), 5000)

@time solve!(problem, solver)
------------------------------------------------------------------
          COSMO v0.8.8 - A Quadratic Objective Conic Solver
                         Michael Garstka
                University of Oxford, 2017 - 2022
------------------------------------------------------------------

Problem:  x ∈ R^{53},
          constraints: A ∈ R^{107x53} (5056 nnz),
          matrix size to factor: 160x160,
          Floating-point precision: Float64
Sets:     SecondOrderCone of dim: 101
          SecondOrderCone of dim: 3
          ZeroSet of dim: 2
          Nonnegatives of dim: 1
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
          ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
          ρ = 0.1, σ = 1e-06, α = 1.6,
          max_iter = 5000,
          scaling iter = 10 (on),
          check termination every 25 iter,
          check infeasibility every 40 iter,
          KKT system solver: QDLDL
Acc:      Anderson Type2{QRDecomp},
          Memory size = 15, RestartedMemory,    
          Safeguarded: true, tol: 2.0
Setup Time: 1.0ms

Iter:   Objective:  Primal Res: Dual Res:   Rho:
1   -2.4722e+00 5.5123e+00  3.3167e+02  1.0000e-01
25   2.3312e+01 3.9932e-01  5.2608e+00  1.0000e-01
50   1.8682e+01 1.6042e-01  2.6250e-03  1.1189e-02
75   1.9742e+01 1.0033e-01  2.6586e-03  1.1189e-02
100  2.0476e+01 6.2798e-02  2.1036e-03  1.1189e-02
125  2.1783e+01 2.4121e-05  4.5987e-08  1.1189e-02

------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 180 (incl. 55 safeguarding iter)
Optimal objective: 21.78
Runtime: 0.005s (5.5ms)

  0.008572 seconds (9.01 k allocations: 4.325 MiB)
# Check the status, optimal value, and minimizer of the problem
problem.status, problem.optval, β̂cls.value
(MathOptInterface.OPTIMAL, 21.783219128578512, [8.821527236654916; 33.70209380229668; … ; -22.36199323318564; 12.46850092930839;;])

Hypatia

Switch to the open source Hypatia (pure Julia) solver:

# Use Hypatia solver
using Hypatia

solver = Hypatia.Optimizer()
MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), 1)

@time solve!(problem, solver)

 iter        p_obj        d_obj |  abs_gap    x_feas    z_feas |      tau       kap        mu | dir_res     prox  step     alpha
    0   1.0000e+00  -2.0000e+00 | 1.05e+02  1.42e+00  6.10e-01 | 1.00e+00  1.00e+00  1.00e+00 |
    1   5.2739e+00   6.7502e+00 | 1.57e+01  1.10e+00  4.73e-01 | 1.93e-01  8.86e-01  1.50e-01 | 1.6e-15  7.6e-01  co-a  8.50e-01
    2   1.4674e+01   2.1404e+01 | 4.70e+00  8.72e-01  3.76e-01 | 7.31e-02  6.72e-01  4.48e-02 | 1.0e-14  7.7e-01  co-a  7.00e-01
    3   2.6469e+01   3.3048e+01 | 1.87e+00  5.18e-01  2.23e-01 | 4.92e-02  3.96e-01  1.78e-02 | 4.7e-15  8.5e-01  co-a  6.00e-01
    4   3.4927e+01   3.6432e+01 | 5.59e-01  1.62e-01  7.00e-02 | 4.71e-02  9.25e-02  5.32e-03 | 3.4e-15  6.9e-01  co-a  7.00e-01
    5   3.1820e+01   3.2115e+01 | 2.21e-01  5.26e-02  2.26e-02 | 5.82e-02  2.58e-02  2.10e-03 | 4.0e-15  9.1e-01  co-a  6.00e-01
    6   2.9466e+01   2.9625e+01 | 1.52e-01  2.88e-02  1.24e-02 | 7.43e-02  1.79e-02  1.45e-03 | 1.6e-13  8.7e-01  co-a  3.00e-01
    7   2.8219e+01   2.8318e+01 | 1.05e-01  1.83e-02  7.90e-03 | 8.18e-02  1.23e-02  1.00e-03 | 5.3e-14  7.1e-01  co-a  3.00e-01
    8   2.7285e+01   2.7333e+01 | 7.33e-02  1.12e-02  4.83e-03 | 9.35e-02  7.46e-03  6.98e-04 | 8.1e-14  4.5e-01  co-a  3.00e-01
    9   2.6206e+01   2.6220e+01 | 3.64e-02  4.73e-03  2.04e-03 | 1.11e-01  3.11e-03  3.47e-04 | 5.4e-13  9.2e-01  co-a  5.00e-01
   10   2.5648e+01   2.5653e+01 | 1.80e-02  2.09e-03  8.98e-04 | 1.26e-01  1.40e-03  1.72e-04 | 5.1e-13  8.7e-01  co-a  5.00e-01
   11   2.5499e+01   2.5502e+01 | 1.26e-02  1.43e-03  6.16e-04 | 1.28e-01  9.41e-04  1.20e-04 | 1.4e-12  5.8e-01  co-a  3.00e-01
   12   2.5293e+01   2.5294e+01 | 5.00e-03  5.44e-04  2.34e-04 | 1.35e-01  3.57e-04  4.77e-05 | 1.3e-12  6.5e-01  co-a  6.00e-01
   13   2.5216e+01   2.5216e+01 | 2.00e-03  2.13e-04  9.19e-05 | 1.38e-01  1.39e-04  1.90e-05 | 2.5e-12  9.5e-01  co-a  6.00e-01
   14   2.5191e+01   2.5191e+01 | 9.93e-04  1.06e-04  4.58e-05 | 1.38e-01  6.89e-05  9.45e-06 | 2.3e-12  7.2e-01  co-a  5.00e-01
   15   2.5176e+01   2.5176e+01 | 3.95e-04  4.26e-05  1.83e-05 | 1.38e-01  2.75e-05  3.76e-06 | 1.0e-12  8.9e-01  co-a  6.00e-01
   16   2.5170e+01   2.5170e+01 | 1.57e-04  1.71e-05  7.34e-06 | 1.38e-01  1.09e-05  1.50e-06 | 2.5e-12  7.5e-01  co-a  6.00e-01
   17   2.5167e+01   2.5167e+01 | 6.26e-05  6.85e-06  2.95e-06 | 1.37e-01  4.37e-06  5.96e-07 | 2.6e-12  7.2e-01  co-a  6.00e-01
   18   2.5166e+01   2.5166e+01 | 1.87e-05  2.06e-06  8.87e-07 | 1.37e-01  1.31e-06  1.78e-07 | 5.4e-12  6.6e-01  co-a  7.00e-01
   19   2.5166e+01   2.5166e+01 | 5.59e-06  6.20e-07  2.67e-07 | 1.37e-01  3.93e-07  5.33e-08 | 9.9e-11  6.7e-01  co-a  7.00e-01
   20   2.5166e+01   2.5166e+01 | 1.12e-06  1.24e-07  5.35e-08 | 1.36e-01  7.85e-08  1.06e-08 | 1.5e-10  7.6e-01  co-a  8.00e-01
   21   2.5166e+01   2.5166e+01 | 2.22e-07  2.50e-08  1.08e-08 | 1.36e-01  1.57e-08  2.11e-09 | 6.5e-10  7.7e-01  co-a  8.00e-01
   22   2.5166e+01   2.5166e+01 | 3.32e-08  3.74e-09  1.61e-09 | 1.36e-01  2.34e-09  3.16e-10 | 5.6e-10  7.2e-01  co-a  8.50e-01
optimal solution found; terminating

status is Optimal after 22 iterations and 0.282 seconds

  0.295836 seconds (61.44 k allocations: 7.147 MiB)
# Check the status, optimal value, and minimizer of the problem
problem.status, problem.optval, β̂cls.value
(MathOptInterface.OPTIMAL, 21.782684662699275, [8.821527227307735; 33.702093762570016; … ; -22.361993192180634; 12.468500936073355;;])

Zero-sum lasso

Suppose we want to know which organisms (OTUs) are associated with the response. We can answer this question using a zero-sum contrained lasso \[ \begin{array}{ll} \text{minimize} & \frac 12 \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1 \\ \text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0. \end{array} \]

Varying \(\lambda\) from small to large values will generate a solution path.

# solve at a grid of λ
λgrid = 0:0.01:0.35
β̂path = zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λ
β̂classo = Variable(size(X, 2))
@time for i in 1:length(λgrid)
    λ = λgrid[i]
    # objective
    problem = minimize(0.5sumsquares(y - X * β̂classo) + λ * sum(abs, β̂classo))
    # constraint
    problem.constraints += sum(β̂classo) == 0 # constraint
    # solver = Hypatia.Optimizer() # try different solvers
    # MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), 0) # keep silent
    solver = COSMO.Optimizer() 
    MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), false) # keep silent    
    solve!(problem, solver)
    β̂path[i, :] = β̂classo.value
end
  1.404733 seconds (1.71 M allocations: 310.095 MiB, 48.15% gc time, 0.96% compilation time)
using Plots; gr()

p = plot(collect(λgrid), β̂path, legend=:none)
xlabel!(p, "lambda")
ylabel!(p, "beta_hat")
title!(p, "Zero-sum Lasso")

Zero-sum group lasso

Suppose we want to do variable selection not at the OTU level, but at the Phylum level. OTUs are clustered into various Phyla. We can answer this question using a sum-to-zero contrained group lasso \[ \begin{array}{ll} \text{minimize} & \frac 12 \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|_2^2 + \lambda \sum_j \|\boldsymbol{\beta}_j\|_2 \\ \text{subject to} & \mathbf{1}^T\boldsymbol{\beta} = 0, \end{array} \] where \(\boldsymbol{\beta}_j\) is the \(j\)th partition of the regression coefficients corresponding to the \(j\)th phylum. This is a second-order cone programming (SOCP) problem readily modeled by Convex.jl.

Let’s assume each 10 contiguous OTUs belong to one Phylum.

# solve at a grid of λ
λgrid = 0.1:0.005:0.5
β̂pathgrp = zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λ
β̂classo = Variable(size(X, 2))
@time for i in 1:length(λgrid)
    λ = λgrid[i]
    # loss
    obj = 0.5sumsquares(y - X * β̂classo)
    # group lasso penalty term
    for j in 1:(size(X, 2)/10)
        βj = β̂classo[(10(j-1)+1):10j]
        obj = obj + λ * norm(βj)
    end
    problem = minimize(obj)
    # constraint
    problem.constraints += sum(β̂classo) == 0 # constraint
    solver = Mosek.Optimizer()   # MOSEK this time!
    MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 0)  # keep silent
    solve!(problem, solver)
    β̂pathgrp[i, :] = β̂classo.value
end
  0.846394 seconds (679.84 k allocations: 170.177 MiB, 6.09% gc time)

It took Mosek <2 seconds to solve this seemingly hard optimization problem at 80 different \(\lambda\) values.

p2 = plot(collect(λgrid), β̂pathgrp, legend=:none)
xlabel!(p2, "lambda")
ylabel!(p2, "beta_hat")
title!(p2, "Zero-Sum Group Lasso")

Example: matrix completion

Load the \(128 \times 128\) Lena image with missing pixels.

using ImageIO, FileIO # detect file formats and dispatch to appropriate readers/writers

lena = load("lena128missing.png")
# convert to real matrices
Y = Float64.(lena)
128×128 Matrix{Float64}:
 0.0       0.0       0.635294  0.0       …  0.0       0.0       0.627451
 0.627451  0.623529  0.0       0.611765     0.0       0.0       0.388235
 0.611765  0.611765  0.0       0.0          0.403922  0.219608  0.0
 0.0       0.0       0.611765  0.0          0.223529  0.176471  0.192157
 0.611765  0.0       0.615686  0.615686     0.0       0.0       0.0
 0.0       0.0       0.0       0.619608  …  0.0       0.0       0.2
 0.607843  0.0       0.623529  0.0          0.176471  0.192157  0.0
 0.0       0.0       0.623529  0.0          0.0       0.0       0.215686
 0.619608  0.619608  0.0       0.0          0.2       0.0       0.207843
 0.0       0.0       0.635294  0.635294     0.2       0.192157  0.188235
 0.635294  0.0       0.0       0.0       …  0.192157  0.180392  0.0
 0.631373  0.0       0.0       0.0          0.0       0.0       0.0
 0.0       0.627451  0.635294  0.666667     0.172549  0.0       0.184314
 ⋮                                       ⋱  ⋮                   
 0.0       0.129412  0.0       0.541176     0.0       0.286275  0.0
 0.14902   0.129412  0.196078  0.537255     0.345098  0.0       0.0
 0.215686  0.0       0.262745  0.0          0.301961  0.0       0.207843
 0.345098  0.341176  0.356863  0.513725     0.0       0.0       0.231373
 0.0       0.0       0.0       0.0       …  0.0       0.243137  0.258824
 0.298039  0.415686  0.458824  0.0          0.0       0.0       0.258824
 0.0       0.368627  0.4       0.0          0.0       0.0       0.235294
 0.0       0.0       0.34902   0.0          0.0       0.239216  0.207843
 0.219608  0.0       0.0       0.0          0.0       0.0       0.2
 0.0       0.219608  0.235294  0.356863  …  0.0       0.0       0.0
 0.196078  0.207843  0.211765  0.0          0.0       0.270588  0.345098
 0.192157  0.0       0.196078  0.309804     0.266667  0.356863  0.0

We fill out the missing pixels using a matrix completion technique developed by Candes and Tao \[ \begin{array}{ll} \text{minimize} & \|\mathbf{X}\|_* \\ \text{subject to} & x_{ij} = y_{ij} \text{ for all observed entries } (i, j). \end{array} \]

Here \(\|\mathbf{X}\|_* = \sum_{i=1}^{\min(m,n)} \sigma_i(\mathbf{X})\) is the nuclear norm. It can be shown that \[ \|\mathbf{X}\|_* = \sup_{\|\mathbf{Y}\|_2 \le 1} \langle \mathbf{X}, \mathbf{Y} \rangle, \] where \(\|\mathbf{Y}\|_2=\sigma_{\max}(\mathbf{Y})\) is the spectral (operator 2-) norm, and \(\langle \mathbf{X}, \mathbf{Y} \rangle = \text{tr}(\mathbf{X}^T\mathbf{Y})\). That is, \(\|\cdot\|_*\) is the dual norm of \(\|\cdot\|_2\).

  • The nuclear norm can be considered as the best convex approximation to \(\text{rank}(M)\), just like \(\|\mathbf{x}\|_1\) is the best convex approximation to \(\|\mathbf{x}\|_0\). We want the matrix with the lowest rank that agrees with the observed entries, but instead seek one with the minimal nuclear norm as a convex relaxation.

  • This is a semidefinite programming (SDP) problem readily modeled by Convex.jl.

  • This example takes long because of high dimensionality.

# Use COSMO solver
using COSMO
solver = COSMO.Optimizer()
## Use SCS solver
# using SCS
# solver = SCS.Optimizer()
## Use Mosek solver
# using Mosek
# solver = Mosek.Optimizer()
## Use Hypatia solver
# using Hypatia
# solver = Hypatia.Optimizer()

# Linear indices of obs. entries
obsidx = findall(Y[:] .≠ 0.0)
# Create optimization variables
X = Convex.Variable(size(Y))
# Set up optmization problem
problem = minimize(nuclearnorm(X))
problem.constraints += X[obsidx] == Y[obsidx]
# Solve the problem by calling solve
@time solve!(problem, solver)
------------------------------------------------------------------
          COSMO v0.8.8 - A Quadratic Objective Conic Solver
                         Michael Garstka
                University of Oxford, 2017 - 2022
------------------------------------------------------------------

Problem:  x ∈ R^{49153},
          constraints: A ∈ R^{73665x49153} (73793 nnz),
          matrix size to factor: 122818x122818,
          Floating-point precision: Float64
Sets:     ZeroSet of dim: 40769
          DensePsdConeTriangle of dim: 32896 (256x256)
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
          ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
          ρ = 0.1, σ = 1e-06, α = 1.6,
          max_iter = 5000,
          scaling iter = 10 (on),
          check termination every 25 iter,
          check infeasibility every 40 iter,
          KKT system solver: QDLDL
Acc:      Anderson Type2{QRDecomp},
          Memory size = 15, RestartedMemory,    
          Safeguarded: true, tol: 2.0
Setup Time: 113.52ms

Iter:   Objective:  Primal Res: Dual Res:   Rho:
1   -1.4426e+03 1.1678e+01  5.9856e-01  1.0000e-01
25   1.4495e+02 5.5360e-02  1.1033e-03  1.0000e-01
50   1.4754e+02 1.2369e-02  1.6744e-03  7.4179e-01
75   1.4797e+02 4.9490e-04  5.5696e-05  7.4179e-01
100  1.4797e+02 1.4115e-05  1.3438e-06  7.4179e-01

------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 100
Optimal objective: 148
Runtime: 2.193s (2193.48ms)

  7.279706 seconds (6.41 M allocations: 966.528 MiB, 5.33% gc time, 73.62% compilation time: 14% of which was recompilation)
using Images

colorview(Gray, X.value)

Nonlinear programming (NLP)

  • We use MLE of Gamma distribution to illustrate some rudiments of nonlinear programming (NLP) in Julia.

  • Let \(x_1,\ldots,x_m\) be a random sample from the gamma density with shape parameter \(\alpha\) and rate parameter \(\beta\): \[ \small f(x) = \Gamma(\alpha)^{-1} \beta^{\alpha} x^{\alpha-1} e^{-\beta x} \] on \((0,\infty)\). The log likelihood function is \[ \small L(\alpha, \beta) = m [- \ln \Gamma(\alpha) + \alpha \ln \beta + (\alpha - 1)\overline{\ln x} - \beta \bar x], \] where \(\overline{x} = \frac{1}{m} \sum_{i=1}^m x_i\) and \(\overline{\ln x} = \frac{1}{m} \sum_{i=1}^m \ln x_i\).

using Random, Statistics, SpecialFunctions
Random.seed!(280)

function gamma_logpdf(x::Vector, α::Real, β::Real)
    m = length(x)
    avg = mean(x)
    logavg = sum(log, x) / m
    m * (- loggamma(α) + α * log(β) +- 1) * logavg - β * avg)
end

x = rand(5)
gamma_logpdf(x, 1.0, 1.0)
-2.718862752561273
  • Many optimization algorithms involve taking derivatives of the objective function. The ForwardDiff.jl package implements automatic differentiation.

  • For example, to compute the derivative and Hessian of the log-likelihood with data x at α=1.0 and β=1.0.

using ForwardDiff
ForwardDiff.gradient-> gamma_logpdf(x, θ...), [1.0; 1.0])
2-element Vector{Float64}:
 -2.147973668615522
  2.281137247438727
ForwardDiff.hessian-> gamma_logpdf(x, θ...), [1.0; 1.0])
2×2 Matrix{Float64}:
 -8.22467   5.0
  5.0      -5.0

Generate data:

using Distributions, Random

Random.seed!(280)
(n, p) = (1000, 2)
(α, β) = 5.0 * rand(p)
x = rand(Gamma(α, β), n)
println("True parameter values:")
println("α = ", α, ", β = ", β)
True parameter values:
α = 2.4692583262857157, β = 4.487606332976687

We use JuMP.jl to model and solve our NLP problem.

Recall that we want to solve the optimization problem: \[ \small \begin{array}{ll} \text{maximize} & L(\alpha, \beta)= m [- \ln \Gamma(\alpha) + \alpha \ln \beta + (\alpha - 1)\overline{\ln x} - \beta \bar x] \\ \text{subject to} & \alpha \ge 0 \\ ~ & \beta \ge 0 \end{array} \]

Observe the similarity and difference in modeling with Convex.jl:

using JuMP

using NLopt # https://nlopt.readthedocs.io/en/latest/
m = Model(NLopt.Optimizer)
set_optimizer_attribute(m, "algorithm", :LD_MMA)

# using Ipopt  # https://github.com/coin-or/Ipopt
# m = Model(Ipopt.Optimizer)
# set_optimizer_attribute(m, "print_level", 3)

myf(a, b) = gamma_logpdf(x, a, b)
JuMP.register(m, :myf, 2, myf, autodiff=true)
@variable(m, α >= 1e-8)
@variable(m, β >= 1e-8)
@NLobjective(m, Max, myf(α, β))

print(m)
status = JuMP.optimize!(m)

println("MLE (JuMP):")
println("α = ", JuMP.value(α), ", β = ", JuMP.value(β))
println("Objective value: ", JuMP.objective_value(m))

\[ \begin{aligned} \max\quad & myf(α, β)\\ \text{Subject to} \quad & α \geq 1.0e-8\\ & β \geq 1.0e-8\\ \end{aligned} \]

MLE (JuMP):
α = 2.482264846433373, β = 0.22245088505669489
Objective value: -3229.1399326327714

Then convert the rate parameter to the scale parameter to compare the result with fit_mle() in the Distribution package:

println("α = ", JuMP.value(α), ", θ = ", 1 / JuMP.value(β))
println("MLE (Distribution package):")
println(fit_mle(Gamma, x))
α = 2.477061702598836, θ = 4.5057945965951935
MLE (Distribution package):
Gamma{Float64}(α=2.4773249047555272, θ=4.505157427525276)