M1399.000200 Homework 3¶

Due Nov 17, 2020 @ 11:59PM¶

Instruction¶

Only submit a README.md file that identify you, this notebook file (hw3.ipynb) with your answers, and the html export of the notebook (hw3.html), along with all code (*.jl) that are necessary to reproduce the results. If you write a Julia function, then it must be in a separate file, not in the notebook. No other files will not be graded.
Before the due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework 1 submission after the deadline, penalty points will be deducted for late submission according to the syllabus.
To save the storage, do not put data files on Git.
Start early. The time to be spent on coding is always underestimated.

Q1. MNIST¶

This problem is about automatic recognition of handwritten digits. The dataset to be used is the famous MNIST dataset available at

https://www.kaggle.com/oddrationale/mnist-in-csv

The dataset mnist_train.csv contains $n = 60, 000$ images for training. An image is a 28 * 28 gray-level array. Each image is stored as a line of the file mnist test.csv, in the format

label, pix(1, 1), pix(1, 2), pix(1, 3), ..., pix(28, 28)

where label is a label corresponding to the digit represented by the image (0, 1, ..., 9), and pix(r, c) is the pixel intensity at row r, column c (an integer between 0 and 255 (inclusive)).

The $i$-th image in a dataset is considered a vector $\mathbf{x}_i \in \mathbb{R}^p$, $p = 28\times 28 = 784$.

A widely used approach towards digit recognition is principal component analysis. Compute the sample covariance matrix $$ \hat{\Sigma} = \frac{1}{n}\sum_{i=1}^n\mathbf{x}_i\mathbf{x}_i^T. $$ \ after standardization. Let $V \in \mathbb{R}^{d\times r}$ be the matrix whose columns are given by the eigenvectors of $\Sigma$ associated with the $r = 10$ largest eigenvalues.

Plot the $r$ principal components as 28 * 28 pixels images. Scale them as to make them visible.
Consider the case $r = 10$ and pick a random image $\mathbf{x}_i$ for each digit 0,1,...,9 from mnist_test.csv that contains $n=10,000$ images for testing. Plot $$ \mathbf{y}_i = \mathbf{V}^T\mathbf{x}_i, $$ the loadings of each image $\mathbf{x}_i$ in terms of the principal components $\mathbf{v}_1,\dotsc,\mathbf{v}_r$. Discuss the possibility of the use of the loadings for digit recognition.

Q2. Convex analysis¶

Prove the following:

(Nonnegative weighted sums) If $f$ and $g$ are convex and $\alpha$ and $\beta$ are nonnegative constants, then $\alpha f + \beta g$ is convex, with $\text{dom}f \cap \text{dom}g$.
(Composition with affine mapping) If $f$ is convex, then composition $f(\mathbf{A}\mathbf{x} + \mathbf{b})$ of $f$ with an affine map $\mathbf{x}\mapsto \mathbf{A}\mathbf{x} + \mathbf{b}$ is convex, with $\text{dom}g=\{\mathbf{x}|\mathbf{Ax}+\mathbf{b}\in\text{dom}f\}$.
(Pointwise maximum and supremum) If $f_i$ is convex for each fixed $i=1,\dotsc,m$, then $g(\mathbf{x}) = \max_{i=1,\dotsc,m} f_i(\mathbf{x})$ is convex, with $\text{dom}g = \cap_{i=1}^m\text{dom}f_i$.
(Composition) For $h:\mathbb{R}\to\mathbb{R}\cup\{\infty\}$ and $g:V\to\mathbb{R}\cup\{\infty\}$, if $h$ is convex and nondecreasing, and $g$ is convex, then $f = h \circ g$ is convex, with $\text{dom}f = \{\mathbf{x}\in\text{dom}g: g(\mathbf{x})\in\text{dom}h\}$.
(Paritial minimization) If $f(\mathbf{x},\mathbf{y})$ is jointly convex in $(\mathbf{x},\mathbf{y})$, then $g(\mathbf{x}) = \inf_{\mathbf{y}\in C} f(\mathbf{x},\mathbf{y})$ is convex provided it is proper and $C$ is nonempty convex, with $\text{dom}g = \{\mathbf{x}: (\exists\mathbf{y}\in C)(\mathbf{x},\mathbf{y})\in\text{dom}f\}$ (projection).
(Perspective) If $f(\mathbf{x})$ is convex and finite, then its perspective $g(\mathbf{x}, t)=t f(t^{-1}\mathbf{x})$ is convex, with $\text{dom}g=\{(\mathbf{x},t): t^{-1}\mathbf{x}\in\text{dom}f,~ t>0\}$.

Q3. LP¶

We want to solve an LP $$ \begin{array}{ll} \text{minimize} & \mathbf{c}^T\mathbf{x} \\ \text{subject to} & \mathbf{A}\mathbf{x}=\mathbf{b} \\ & \mathbf{x} \ge \mathbf{0} \end{array} $$ with $$ \mathbf{A} = \begin{bmatrix} 2 & 0 & 0 & 1 & 0 & 0 \\ 0 & 2 & 0 & 0 & 1 & 0 \\ 0 & 0 & 2 & 0 & 0 & 1 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad \mathbf{c} = (-1, -1, -1, 0, 0, 0)^T. $$

Find the solution to the above LP by hand.
Find the solution to the above LP using R package linprog.
Find the solution to the above LP using Convex.jl with your favorite solver.

Q4. Lasso and Danzig¶

Formulate the Danzig selector problem as an LP.
Characterize $t_{\max}$, for which if $t \ge t_{\max}$, the solution to the Danzig selector problem does not vary.
Download the prostate cancer datae dataset from https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data. The response variable is log of prostate-specific antigen (lpsa). The predictors are log cancer volume (lcavol), log prostate weight (weight), age, log of amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). More information can be found in https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.info.txt; see also Section 3.2.1 of Elements of Statistical Learning by Hastie et al. (2009).

Standardize predictors to have mean 0 and variance 1 and add intercept. Then split data into the training and test sets by 8 to 2. Fit the Dantig selector model for $t=0, 0.05t_{\max}, 0.1 t_{\max}, \dotsc, t_{\max}$ on the training data and plot the solution path. Also plot the prediction error on the test data over $t$.
Repeat 3 for the lasso.

Q5. QP¶

An alternative formulation of the support vector machine (SVM) is $$ \begin{array}{ll}\tag{P} \text{minimize} & \frac{1}{2}\|\boldsymbol\beta\|_2^2 + t\sum_{i=1}^n\xi_i \\ \text{subject to} & y_i(\mathbf{x}_i^T\boldsymbol{\beta}+\beta_0) \ge 1 - \xi_i, ~ i=1,\dotsc,n\\ & \xi_i \ge 0, ~ i=1, \dotsc, n \end{array} $$ for a tuning parameter $t \ge 0$. The optimization variables are $\beta_0, \boldsymbol{\beta}, \xi_1, \dotsc, \xi_n$.

Show that the above formulation is equivalent to the hinge loss formulation $$ \text{minimize}~ \sum_{i=1}^n \left[ 1 - y_i \left( \beta_0 + \mathbf{x}_i^T\boldsymbol{\beta} \right) \right]_+ + \lambda\|\boldsymbol\beta\|_2^2. $$
The dual of the SVM problem (P) is given by

$$ \begin{array}{ll}\tag{D} \text{maximize} & \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j \\ \text{subject to} & 0 \le \alpha_i \le t, ~ i=1,\dotsc,n\\ & \sum_{i=1}\alpha_iy_i = 0. \end{array} $$

for variables $\alpha_1, \dotsc, \alpha_n$. Dual (D) has the same optimal value as (P). Furthermore, if the optimal variables for (D) are $\alpha_1^\star, \dotsc, \alpha_n^\star$, then the optimal variables for (P) are $\boldsymbol{\beta}^\star = \sum_{i=1}^n \alpha_i^\star y_i\mathbf{x}_i$ and $\beta_0^\star= y_i - \mathbf{x}_i^T\boldsymbol{\beta}$ for any $i$ with $0 < \alpha_i^\star < t$.

Is (D) convex? If so, what is its class of convex optimization problem? Discuss the computational advantage of solving (D) over (P).

Download the South African heart disease dataset from https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data. The response variable is the presence or absence of coronary heart disease (CHD) at the time of surgery, and the predictors are sbp, tobacco, ldl, famhist, obesity, alcohol, and age. More information can be found in https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt; see also Section 4.4.2 of Elements of Statistical Learning by Hastie et al. (2009).

Standardize predictors to have mean 0 and variance 1 and add intercept. Then split data into the training and test sets by 8 to 2. Fit the SVM at $\lambda = 0, 1, 2, \dotsc, 100$ on the training data and plot the solution path. Also plot the misclassification rates on the test data over $\lambda$.
Repeat 3 for the dual (D).

Q6. Total variation denoising¶

A corrupted image is given:

Let $\mathbf{Y}\in\mathbb{R}^{128\times 128}$ represent the image. The isotropic total variation (TV) denoising minimizes $$ \frac{1}{2}\|\mathbf{Y}-\mathbf{X}\|_{\rm F}^2 + \lambda \sum_{i,j}\sqrt{(x_{i+1,j}-x_{ij})^2 + (x_{i,j+1}-x_{ij})^2}, \quad \lambda > 0, $$ where $\mathbf{X}=(x_{ij})$ is the optimization variable.

Show that this problem is an SOCP.
Solve the isotropic TV denoising for the given image and show the restored image.

An alternative approach is the anisotropic TV denoising that minimizes $$ \frac{1}{2}\|\mathbf{Y}-\mathbf{X}\|_{\rm F}^2 + \lambda \sum_{i,j}(|x_{i+1,j}-x_{ij}| + |x_{i,j+1}-x_{ij}|), \quad \lambda > 0, $$

Show that this problem is a QP.
Solve the anisotropic TV denoising for the given image and show the restored image. Compare the timing with the isotropic TV.

Q7. SDP¶

For $\mathbf{X}\in\mathbb{R}^{m\times n}$ and $t\in\mathbb{R}_+$, show that $\|\mathbf{X}\|_* \le t$ if and only if there exist $\mathbf{Y}\in\mathbb{R}^{m\times m}$ and $\mathbf{Z}\in\mathbb{R}^{n\times n}$ such that $$ \begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \succeq \mathbf{0}, ~\text{and}~ \frac{1}{2}(\text{tr}\mathbf{Y} + \text{tr}\mathbf{Z}) \le t. $$

(Hint: Let the SVD of $\mathbf{X}$ be $\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ with $\boldsymbol{\Sigma}\in\mathbb{R}^{r\times r}$, where $r=\text{rank}(\mathbf{X})$. Note $$ \text{tr}\left( \begin{bmatrix} \mathbf{U}^T~-\mathbf{V}^T \end{bmatrix} \begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \begin{bmatrix} \mathbf{U} \\ -\mathbf{V}^T \end{bmatrix} \right) \ge 0 $$ if $\begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \succeq \mathbf{0}$.)
Formulate the matrix completion problem directly as an SDP.
Solve the matrix completion problem covered in class using the direct SDP formulation above.