M1399.000200 Homework 4¶

Due Dec 14, 2020 @ 11:59PM¶

Instruction¶

Only submit a README.md file that identify you, this notebook file (hw4.ipynb) with your answers, and the html export of the notebook (hw4.html), along with all code (*.jl) that are necessary to reproduce the results. If you write a Julia function, then it must be in a separate file, not in the notebook. No other files will not be graded.
Before the due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework 1 submission after the deadline, penalty points will be deducted for late submission according to the syllabus.
To save the storage, do not put data files on Git.
Start early. The time to be spent on coding is always underestimated.

Q1. Lagrange duality¶

The Huber loss function is given by \begin{eqnarray*} \phi(r) = \begin{cases} r^2, & |r| \le M, \\ M(2|r| - M,) & |r| > M. \end{cases} \end{eqnarray*} The robust regression problem using the Huber loss function solves \begin{eqnarray*} &\text{minimize}& \sum_{i=1}^n \phi(y_i - \mathbf{x}_i^T \beta) \end{eqnarray*} where $x_{i1} = 1$.

Show that $$ \phi(r) = 2M \inf_{u+v=r} \{|v| + u^2/2M \}. $$
Show that the robust regression problem above can be formulated as a QP \begin{eqnarray*} &\text{minimize}& \mathbf{u}^T \mathbf{u} + 2 M \mathbf{1}^T \mathbf{t} \\ &\text{subject to}& \mathbf{u} - \mathbf{t} \leq \mathbf{y} - \mathbf{X} \beta \leq \mathbf{u} + \mathbf{t} \end{eqnarray*} where the optimization variables are $\mathbf{u}, \mathbf{t} \in \mathbb{R}^n$ and $\beta \in \mathbb{R}^p$. Compare this formulation with that of the leture notes.
Find the dual optimization problem of the above QP.

Q2. Conic programming¶

A convex cone $K$ in $\mathbb{R}^d$ is called proper if it is

closed;
pointed: $K \cap (-K) = \{\mathbf{0}\}$;
and has nonempty interior.

For a proper cone $K$, we define generalized inequality such that \begin{align*} \mathbf{x} &\succeq_K \mathbf{y} \iff \mathbf{x} - \mathbf{y} \in K \\ \mathbf{x} &\succ_K \mathbf{y} \iff \mathbf{x} - \mathbf{y} \in \text{int}K \\ \mathbf{x} &\preceq_K \mathbf{y} \iff \mathbf{y} - \mathbf{x} \in K \\ \mathbf{x} &\prec_K \mathbf{y} \iff \mathbf{y} - \mathbf{x} \in \text{int}K \\ \end{align*}

Show that $\mathbb{R}_{+}^d = \{\mathbf{x}\in\mathbb{R}^d: x_i \ge 0,~i=1,\dotsc,d\}$ is a proper cone and the generalized inequality for $\mathbb{R}_{+}$ is the usual componentwise inequality.
Show that the generalized inequality for any proper cone $K$ defines a partial order. That is,
- $\mathbf{x} \preceq_K \mathbf{x}$
- $\mathbf{x} \preceq_K \mathbf{y} \preceq_K \mathbf{z} \implies \mathbf{x} \preceq_K \mathbf{z}$
- $\mathbf{x} \preceq_K \mathbf{y}$ and $\mathbf{x} \succeq_K \mathbf{y} \implies \mathbf{x} = \mathbf{y}$.
Conic programming refers to an optimization problem $$ \begin{array}{ll} \text{minimize} & \mathbf{c}^T\mathbf{x} \\ \text{subject to} & \mathbf{A}\mathbf{x} \preceq_K \mathbf{b} \end{array} $$ for a given proper cone $K \subset \mathbb{R}^d$. Discuss how a convex optimization problem $$ \begin{array}{ll} \text{minimize} & \mathbf{c}^T\mathbf{x} \\ \text{subject to} & \mathbf{A}\mathbf{x} - \mathbf{b} \in C, \end{array} $$ where $C$ is a closed convex set in $\mathbb{R}^{d'}$ that has nonempty interior and does not contain a straight line, can be converted to a conic programming problem. Specify the proper cone $K$ in terms of $C$. (Hint. Perspective transform.)
Discuss how to construct a Lagrangian for the conic programming above. Show that the dual of the above conic programming problem is
$$ \begin{array}{ll} \text{maximize} & -\mathbf{b}^T\mathbf{y} \\ \text{subject to} & \mathbf{A}^T\mathbf{y} + \mathbf{c} = \mathbf{0} \\ ~ & \mathbf{y} \preceq_{K^{\circ}} \mathbf{0}, \end{array} $$ where $$ K^{\circ}=\{\mathbf{y} \in \mathbb{R}^d: \mathbf{x}^T\mathbf{y} \le 0\} $$ is the polar cone of $K$.
Show that the set $\mathbb{S}_+^d$ of $d\times d$ symmetric positive semindefinte matrices is a proper cone. What is its polar cone?
In Homework 3, we showed that the matrix completion problem $$ \begin{array}{ll} \text{minimize} & \|\mathbf{X}\|_* \\ \text{subject to} & x_{ij} = y_{ij} ~\text{for all observed entries } (i, j) \end{array} $$ can be formulated as an SDP $$ \begin{array}{ll} \text{minimize} & \frac{1}{2}(\text{tr}\mathbf{Y} + \text{tr}\mathbf{Z}) \\ \text{subject to} & \begin{bmatrix} \mathbf{Y} & \mathbf{X} \\ \mathbf{X}^T & \mathbf{Z} \end{bmatrix} \succeq \mathbf{0}. \end{array} $$ Derive the dual problem of this SDP and show that it is also an SDP. Solve the matrix completion problem covered in class using the dual SDP formulation directly. Do the optimal objective values coincide?

Q3. GP¶

Suppose a diagnostic test uses a random variable $X$ to diagnose a certain disease; if the value of $X$ is larger than a critical value $c$, the subject of the diagnosis is classfied into the disease (positive) class; otherwise into the non-disease (negative) class. Each critical value corresponds to a classifier. Then, we can define the class-conditional survival functions for the classifier $$ S_{+}(c) = P_{+}(X > c), \quad S_{-}(c) = P_{-}(X > c) $$ where $P_{+}$ (resp. $P_{-}$) is the condition distribution of $X$ given that the measurement is from the positive (resp. negative) class.

Then the true positive rate (TPR) and false positive rate (FPR) of the classifier are defined as $$ TPR(c) = S_{+}(c), \quad FPR(c) = S_{-}(c). $$ Finally, the reciever operating characteristic (ROC) curve is the parametric curve $\{(FPR(c), TPR(c)): c \in \mathbb{R}\}$.

The objective of this problem is in the estimation of the ROC curve. The dataset is observed realizations of $\mathbf{X}$ for each class: $\mathcal{D}=\cup_{i=+,-}\{x_{i1}, \dotsc, x_{in_i}\}$.

The empirical ROC curve is the plot of empirical FPR vs. TPR, i.e., $\{(\hat{S}_{-}(c), \hat{S}_{+}(c)): c \in \mathbb{R}\}$, where $\hat{S}_{+}$ and $\hat{S}_{-}$ are the empirical class-conditional survival functions. Assuming that the observations are independent, it can be shown that $(\hat{S}_{+}, \hat{S}_{-})$ maximizes the nonparametric likelihood of the data $$ \mathcal{L}(S_{+}, S_{-}) = \prod_{i=+, -}\prod_{j=1}^{m+1}[S_i(x_{j-1}) - S_i(x_j)]^{d_{ij}}, $$ where $x_1 < x_2 < \dotsb < x_m$ are ordered unique data points in $\mathcal{D}$, and $d_{ij}$ is the number of observations of class $i$ at $x_j$; we interpret $x_0 = -\infty$ and $x_{m+1}=\infty$.

Show that the maximum nonparametric likelihood (NPMLE) problem above can be cast as a GP. (Hint. let $p_{ij} = S_i(x_j)/S_i(x_{j-1})$ and introduce auxiliary variables $q_{ij} = 1 - p_{ij} \ge 0$. How would you handle the equality constraints $p_{ij} + q_{ij} = 1$ involving posynomials?)
Suppose now we want to add a shape constraint to the estimated ROC curve. Specifically, we want the curve to be concave. This constraint can be satisfied by adding inequality constraints to the likelihood function above: $$ \frac{S_{+}(x_j)-S_{+}(x_{j-1})}{S_{-}(x_j)-S_{-}(x_{j-1})} \le \frac{S_{+}(x_{j+1})-S_{+}(x_{j})}{S_{-}(x_{j+1})-S_{-}(x_{j})}, \quad j=1, \dotsc, m. $$ Show that the resulting concavity-constrained ROC curve estimation problem can be cast as a GP.
Find 1) the empirical ROC curve, and 2) concavity-constrained ROC curve for the data

score	class
0.9	+
0.8	+
0.7	-
0.6	+
0.55	+
0.5	+
0.45	-
0.4	+
0.35	+
0.3	-
0.27	+
0.2	-
0.18	-
0.1	+
0.02	-

using GP, and plot them on the (FPR, TPR) plane.

Q4. First-order lasso¶

Redo Q4 of Homework 3 for the lasso penalized linear regression on the prostate cancer dataset, but this time using

proximal gradient method;
FISTA; and
coordinate descent.

Choose a fixed value of the penalty parameter $\lambda$, and plot the progress of the algorithm (objective value vs. iteration count) to compare the convergence behavior of the algorithms. For a fair comparison, treat a whole sweep of coordinate descent steps (updating all coordinates) as a single iteration.

Grading will be mostly based in the correctness and efficiency of your implementation.

Q5. MNIST¶

In Homework 3, we tried automatic recognition of handwritten digits using principal component analysis (PCA). A more interpretable representation is obtained by non-negative matrix factorization (NMF). Use again the MNIST dataset. Let the data matrix be $\mathbf{X} = [\mathbf{x}_1 | \dotsb | \mathbf{x}_n]^T$, where $\mathbf{x}_i \in \mathbb{R}^d$ with $d=28\times 28 = 784$ and $n=60,000$.

NMF approximates $\mathbf{X} \in \mathbb{R}^{n \times d}$ with nonnegative entries $x_{ij}$ by a product of two low-rank matrices $\mathbf{V} \in \mathbb{R}^{n \times r}$ and $\mathbf{W} \in \mathbb{R}^{d \times n}$ with nonnegative entries $v_{ik}$ and $w_{kj}$. Consider minimization of the squared Frobenius norm $$ L(\mathbf{V}, \mathbf{W}) = \|\mathbf{X} - \mathbf{V} \mathbf{W}\|_{\text{F}}^2 = \sum_i \sum_j \left(x_{ij} - \sum_k v_{ik} w_{kj} \right)^2, \quad v_{ik} \ge 0, w_{kj} \ge 0, $$ which should lead to a good factorization.

A majorization-minimization (MM) algorithm gives multiplicative updates \begin{align*} v_{ik}^{(t+1)} &= v_{ik}^{(t)} \frac{\sum_j x_{ij} w_{kj}^{(t)}}{\sum_j b_{ij}^{(t)} w_{kj}^{(t)}}, \quad b_{ij}^{(t)} = \sum_k v_{ik}^{(t)} w_{kj}^{(t)}, \\ w_{kj}^{(t+1)} &= w_{kj}^{(t)} \frac{\sum_i x_{ij} v_{ik}^{(t+1)}}{\sum_i b_{ij}^{(t+1/2)} v_{ik}^{(t+1)}}, \quad b_{ij}^{(t+1/2)} = \sum_k v_{ik}^{(t+1)} w_{kj}^{(t)} \end{align*} that drive the objective $L^{(t)} = L(\mathbf{V}^{(t)}, \mathbf{W}^{(t)})$ downhill.

Derive the above update rule from the MM principle.
Implement the algorithm with the following interface:
```
function nnmf(
 X::Matrix{T},
 r::Integer;
 maxiter::Integer=1000, 
 tol::Number=1e-4,
 V::Matrix{T}=rand(T, size(X, 1), r),
 W::Matrix{T}=rand(T, r, size(X, 2))
 ) where T <: AbstractFloat
 # implementation
 # Output
 return V, W
end
```
Start your algorithm from $v_{ik}^{(0)} = w_{kj}^{(0)} = 1$ for all $i,j,k$ and use the stopping criterion $$ \frac{|L^{(t+1)} - L^{(t)}|}{|L^{(t)}| + 1} \le tol. $$

Apply the algorithm with r=3, 5, 10 and report the run times, using @time.

Efficiency (both speed and memory) as well as correctness of the implementation will be the most important criterion when grading this problem.
Plot the $r$ rows of the resulting matrix $\mathbf{H}$ as 28x28 images for the tried values of $r$ above. You may need to scale them to be visible.
Consider the case $r=10$. Pick the image $\mathbf{x}_i$ used in Homework 3, Q1-2, and plot its NMF coefficients, i.e., the $i$th row of matrix $\mathbf{V}$. Compare the result with the PCA loadings of Homework 3, Q1-2.