M1399.000200 Homework 4

Due Dec 13, 2021 @ 11:59PM

Instruction

  1. Only submit a README.md file that identify you, this notebook file (hw4.ipynb) with your answers, and the html export of the notebook (hw4.html), along with all code (*.jl) that are necessary to reproduce the results. If you write a Julia function, then it must be in a separate file, not in the notebook. No other files will not be graded.

  2. Before the due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework 1 submission after the deadline, penalty points will be deducted for late submission according to the syllabus.

  3. To save the storage, do not put data files on Git.

  4. Start early. The time to be spent on coding is always underestimated.

Q1. SVM and Lagrange duality

An alternative formulation of the support vector machine (SVM) is $$ \begin{array}{ll}\tag{P} \text{minimize} & \frac{1}{2}\|\boldsymbol\beta\|_2^2 + t\sum_{i=1}^n\xi_i \\ \text{subject to} & y_i(\mathbf{x}_i^T\boldsymbol{\beta}+\beta_0) \ge 1 - \xi_i, ~ i=1,\dotsc,n\\ & \xi_i \ge 0, ~ i=1, \dotsc, n \end{array} $$ for a tuning parameter $t \ge 0$. The optimization variables are $\beta_0, \boldsymbol{\beta}, \xi_1, \dotsc, \xi_n$.

1) Show that the above formulation is equivalent to the hinge loss formulation $$ \text{minimize}~ \sum_{i=1}^n \left[ 1 - y_i \left( \beta_0 + \mathbf{x}_i^T\boldsymbol{\beta} \right) \right]_+ + \lambda\|\boldsymbol\beta\|_2^2. $$

2) Using the Lagrange duality, show that the dual of the SVM problem (P) is given by

$$ \begin{array}{ll}\tag{D} \text{maximize} & \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j \\ \text{subject to} & 0 \le \alpha_i \le t, ~ i=1,\dotsc,n\\ & \sum_{i=1}\alpha_iy_i = 0. \end{array} $$

for variables $\alpha_1, \dotsc, \alpha_n$.

Is (D) convex? If so, what is its class of convex optimization problem? Discuss the computational advantage of solving (D) over (P).

3) Download the South African heart disease dataset from https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data. The response variable is the presence or absence of coronary heart disease (CHD) at the time of surgery, and the predictors are sbp, tobacco, ldl, famhist, obesity, alcohol, and age. More information can be found in https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt; see also Section 4.4.2 of Elements of Statistical Learning by Hastie et al. (2009).

Standardize predictors to have mean 0 and variance 1 and add intercept. Then split data into the training and test sets by 8 to 2. Fit the SVM at $\lambda = 0, 1, 2, \dotsc, 100$ on the training data and plot the solution path. Also plot the misclassification rates on the test data over $\lambda$.

4) Repeat 3 for the dual (D). Does the dual (D) have the same optimal value as (P)?

Q2. SOCP

A corrupted image is given:

Let $\mathbf{Y}\in\mathbb{R}^{128\times 128}$ represent the image. The isotropic total variation (TV) denoising minimizes $$ \frac{1}{2}\|\mathbf{Y}-\mathbf{X}\|_{\rm F}^2 + \lambda \sum_{i,j}\sqrt{(x_{i+1,j}-x_{ij})^2 + (x_{i,j+1}-x_{ij})^2}, \quad \lambda > 0, $$ where $\mathbf{X}=(x_{ij})$ is the optimization variable.

  1. Show that this problem is an SOCP.

2) Solve the isotropic TV denoising for the given image and show the restored image.

An alternative approach is the anisotropic TV denoising that minimizes $$ \frac{1}{2}\|\mathbf{Y}-\mathbf{X}\|_{\rm F}^2 + \lambda \sum_{i,j}(|x_{i+1,j}-x_{ij}| + |x_{i,j+1}-x_{ij}|), \quad \lambda > 0, $$

3) Show that this problem is a QP.

4) Solve the anisotropic TV denoising for the given image and show the restored image. Compare the timing with the isotropic TV.

Q3. SDP

1) For $\mathbf{X}\in\mathbb{R}^{m\times n}$ and $t\in\mathbb{R}_+$, show that $\|\mathbf{X}\|_* \le t$ if and only if there exist $\mathbf{Y}\in\mathbb{R}^{m\times m}$ and $\mathbf{Z}\in\mathbb{R}^{n\times n}$ such that $$ \begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \succeq \mathbf{0}, ~\text{and}~ \frac{1}{2}(\text{tr}\mathbf{Y} + \text{tr}\mathbf{Z}) \le t. $$

(Hint: Let the SVD of $\mathbf{X}$ be $\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ with $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$, where $r=\text{rank}(\mathbf{X})$. Note $$ \text{tr}\left( \begin{bmatrix} \mathbf{U}^T~-\mathbf{V}^T \end{bmatrix} \begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \begin{bmatrix} \mathbf{U} \\ -\mathbf{V}^T \end{bmatrix} \right) \ge 0 $$ if $\begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \succeq \mathbf{0}$.)

2) Formulate the matrix completion problem directly as an SDP.

3) Solve the matrix completion problem covered in class using the direct SDP formulation above.

4) Using the Lagrange duality, show that the dual problem of the SDP in part 2 is equivalent to SDP $$ \begin{array}{ll} \text{minimize} & \langle \mathbf{W}, \mathbf{C} \rangle \\ \text{subject to} & \begin{bmatrix} \mathbf{I} & \mathbf{W} \\ \mathbf{W} & \mathbf{I} \end{bmatrix} \succeq \mathbf{0} \\ & \mathbf{W}_{ij} = 0, \quad \text{if}~c_{ij}~\text{is unobserved}, \end{array} $$ where $\mathbf{M} \in \mathbb{R}^{m\times n}$ is the optimization variable, $\mathbf{C}=(c_{ij})$ is the data matrix such that $\mathbf{C}_{ij} = 0$ if $y_{ij}$ is unobserved; $\langle \mathbf{A}, \mathbf{B} \rangle = \text{trace}(\mathbf{A}^T\mathbf{B}) = \sum_{i,j} \mathbf{A}_{ij}\mathbf{B}_{ij}$. (Hint. The Lagrange multiplier for a linear matrix inequality $\mathcal{A}(x) \preceq \mathbf{0}$ is required to be symmetric and positive semidefinite.)

5) Solve the matrix completion problem covered in class using the dual SDP formulation directly. Do the optimal objective values coincide?

Q4. First-order lasso

Redo Q6 of Homework 3 for the lasso penalized linear regression on the prostate cancer dataset, but this time using

  1. proximal gradient method;
  2. FISTA; and
  3. coordinate descent.

Choose a fixed value of the penalty parameter $\lambda$, and plot the progress of the algorithm (objective value vs. iteration count) to compare the convergence behavior of the algorithms. For a fair comparison, treat a whole sweep of coordinate descent steps (updating all coordinates) as a single iteration.

Grading will be mostly based in the correctness and efficiency of your implementation.

Q5. MNIST

In Homework 3, we tried automatic recognition of handwritten digits using principal component analysis (PCA). A more interpretable representation is obtained by non-negative matrix factorization (NMF). Use again the MNIST dataset. Let the data matrix be $\mathbf{X} = [\mathbf{x}_1 | \dotsb | \mathbf{x}_n]^T$, where $\mathbf{x}_i \in \mathbb{R}^d$ with $d=28\times 28 = 784$ and $n=60,000$.

NMF approximates $\mathbf{X} \in \mathbb{R}^{n \times d}$ with nonnegative entries $x_{ij}$ by a product of two low-rank matrices $\mathbf{V} \in \mathbb{R}^{n \times r}$ and $\mathbf{W} \in \mathbb{R}^{d \times n}$ with nonnegative entries $v_{ik}$ and $w_{kj}$. Consider minimization of the squared Frobenius norm $$ L(\mathbf{V}, \mathbf{W}) = \|\mathbf{X} - \mathbf{V} \mathbf{W}\|_{\text{F}}^2 = \sum_i \sum_j \left(x_{ij} - \sum_k v_{ik} w_{kj} \right)^2, \quad v_{ik} \ge 0, w_{kj} \ge 0, $$ which should lead to a good factorization.

A majorization-minimization (MM) algorithm gives multiplicative updates \begin{align*} v_{ik}^{(t+1)} &= v_{ik}^{(t)} \frac{\sum_j x_{ij} w_{kj}^{(t)}}{\sum_j b_{ij}^{(t)} w_{kj}^{(t)}}, \quad b_{ij}^{(t)} = \sum_k v_{ik}^{(t)} w_{kj}^{(t)}, \\ w_{kj}^{(t+1)} &= w_{kj}^{(t)} \frac{\sum_i x_{ij} v_{ik}^{(t+1)}}{\sum_i b_{ij}^{(t+1/2)} v_{ik}^{(t+1)}}, \quad b_{ij}^{(t+1/2)} = \sum_k v_{ik}^{(t+1)} w_{kj}^{(t)} \end{align*} that drive the objective $L^{(t)} = L(\mathbf{V}^{(t)}, \mathbf{W}^{(t)})$ downhill.

1) Derive the above update rule from the MM principle.

2) Implement the algorithm with the following interface:

function nnmf(
    X::Matrix{T},
    r::Integer;
    maxiter::Integer=1000, 
    tol::Number=1e-4,
    V::Matrix{T}=rand(T, size(X, 1), r),
    W::Matrix{T}=rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    # implementation
    # Output
    return V, W
end

Start your algorithm from $v_{ik}^{(0)} = w_{kj}^{(0)} = 1$ for all $i,j,k$ and use the stopping criterion $$ \frac{|L^{(t+1)} - L^{(t)}|}{|L^{(t)}| + 1} \le tol. $$

Apply the algorithm with r=3, 5, 10 and report the run times, using @time.

Efficiency (both speed and memory) as well as correctness of the implementation will be the most important criterion when grading this problem.

4) Consider the case $r=10$. Pick the image $\mathbf{x}_i$ used in Homework 3, Q1-2, and plot its NMF coefficients, i.e., the $i$th row of matrix $\mathbf{V}$. Compare the result with the PCA loadings of Homework 3, Q2-2.