M1399.000200 Homework 2¶

Due Oct 21, 2021 @ 11:59PM¶

Instruction¶

Only submit a README.md file that identify you, this notebook file (hw2.ipynb) with your answers, and the html export of the notebook (hw2.html), along with all code (*.jl) that are necessary to reproduce the results. No other files will not be graded.
Before the due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework submission after the deadline, penalty points will be deducted for late submission according to the syllabus.
To save the storage, do not put data files on Git.
Start early. The time to be spent on coding is always underestimated.

Q1. Triangular systems¶

Show the following facts about triangular matrices. A unit triangular matrix is a triangular matrix with all diagonal entries being 1.

The product of two upper (lower) triangular matrices is upper (lower) triangular.
The product of two unit upper (lower) triangular matrices is unit upper (lower) triangular.
The inverse of a unit upper (lower) triangular matrix is unit upper (lower) triangular.
An orthogonal upper (lower) triangular matrix is diagonal.

Q2. LU and Cholesky¶

Let $A\in\mathbb{R}^{n\times n}$ be nonsingular. Show that $A$ has an LU decomposition if and only if for each $k$ with $1\le k \le n$, the upper left $k\times k$ block A[1:k,1:k] is nonsingular. Prove that this LU decomopsition is unique.
Complete the proof of the Cholesky decomposition from the lecture note by showing that
- $\mathbf{A}_{22}$ is positive definite, and
- $\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T$ is positive definite of size $(n-1)\times(n-1)$.

Q3. LU Decomposition¶

Implement the LU decomposition in Julia.

Write function LUdecomp! with interface
```
LUdecomp!(A::Matrix{T}; tol::T=convert(T, 1e-8)) where T <: AbstractFloat
```
\ The decomposition must be done in place. That is, if $A=LU \in \mathbb{R}^{n\times n}$, the $U$ should overwrite the upper triangular part of the input matrix A, and the strictly lower triangular part of A should be overwritten by the same part of the $L$ matrix computed. (Where does the diagonal part of $L$ go?) You must also implement partial pivioting: function LUdecomp! must return
```
Tuple{Array{Int64,1},Int64}
```
\ where the first element of the tuple is the array of the permutation indexes of the rows, and the second element is the indicator of success: if A is (numerically) singular, the function must return the row index where singularity occurs (where may a singularity occur in the LU decompostion?) as the second return value; otherwise return 0. Use tol to determine the singularity.
Using the LUdecomp! function written above, write function LUsolve! that solves the linear equation $Ax = b$ with interface
```
LUsolve!(A::Matrix{T}, b::Vector{T}) where T <: AbstractFloat
```
\ again in place. That is, in addition to A overwritten by LUdecomp!, vector b should be overwitten by the solution $A^{-1}b$. Then, write a wrapper function
```
LUsolve(A::Matrix{T}, b::Vector{T}) where T <: AbstractFloat
```
\ which does not alter neither A nor b but solves $Ax=b$ by calling LUsolve!. Compare your results with the Julia expression A \ b.

Write your functions in a separate .jl file within your branch and have your IJulia notebook to include this file.

Q4. QR decomposition¶

Show that for every $\mathbf{X}\in\mathbb{R}^{n\times p}$, its full QR decomposition exists. Also show that the reduced QR decomposition of $\mathbf{X}$ is unique if $\mathbf{X}$ is of full column rank and the diagonal entries of $\mathbf{R}$ are required to be positive.
From the lecture note on QR decompostion, explain why classical Gram-Schmidt (cgs()) fails with the given matrix A.
From the same lecture note, explain why both classical and modified Gram-Schmidt (cgs() and mgs()) fails with the given matrix B.
Implement the Householder QR decomposition in Julia. The algorithm should be in-place: let the $\mathbf{R}$ matrix occupy the upper triangular part of the input $\mathbf{X}\in\mathbf{R}^{n\times p}$. Below the diagonal place the vectors $\mathbf{u}_k$ that define the Householder transformation matrix $\mathbf{H}_k=\mathbf{I}-2\mathbf{u}_k\mathbf{u}_k^T/\mathbf{u}_k^T\mathbf{u}_k$. By setting the first element of $\mathbf{u}_k$ to 1, you can fit in these vectors in $\mathbf{X}$. The algorithm should return an additional vector storing the values of $\mathbf{u}_k^T\mathbf{u}_k$. This is how the LAPACK routine geqrf is designed. Note that $\mathbf{Q}$ can be recovered from $\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_p$. The function interface should be
```
function householder!(X::Matrix{T}) where T<:AbstractFloat
```
\ returning
```
Tuple{Array{T,2},Array{T,1}}
```
\ Write a separate routine
```
function recoverQ(X::Matrix{T}) where T<:AbstractFloat
```
\ that recovers $Q$. (Hint. For $\mathbf{P}_i=\mathbf{I}-2\mathbf{v}_k\mathbf{v}_k^T$ with $\|\mathbf{v}_k\|=1$, $\mathbf{P}_1\mathbf{P}_2 \cdots \mathbf{P}_{n} = \mathbf{I}- \mathbf{V}\mathbf{T}\mathbf{V}^T$, where $\mathbf{V}=[\mathbf{v}_1 | \mathbf{v}_2 | \dotsb | \mathbf{v}_n]$ for some upper triangular matrix $\mathbf{T}$.)

Q5. Big regression¶

For a big data problem, you should choose the right method in order for the analysis even possible. Visit https://doi.org/10.7910/DVN/HG7NV7. The website hosts data on 123 million flights over 22 years.

The data for each year contains 29 variable. Assuming that all variables for double precision floating point numbers, estimate the size of the design matrix ($\mathbf{X}$) needed to analyze the entire 22 years of flights. Will it fit into your computer's memory?
We are interested in how the time gain of a flight, defined as Gain = DepDelay - ArrDelay, depends on the distance traveled (Distance), departure delay (DepDelay), and carrier (UniqueCarrier).

We want to fit a linear regression Gain ~ 1 + Distance + DepDelay + UniqueCarrier using data from 2001-2008. Note UniqueCarrier is a factor with 23 levels: "9E", "AA", "AQ", "AS", "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", and "YV". We use the dummy coding with "9E" as base level.

Estimate the size of the design matrix $\mathbf{X}$ in double precision.
Review the lecture note on Linear Regression and choose one method in the table to solve the linear regression. The goal is to report the estimated regression coefficients $\widehat \beta$, estimated variance $\widehat \sigma^2 = \sum_i (y_i - \widehat y_i)^2 / (n - p)$, and standard errors of $\widehat \beta$.
Download data from years 2001-2008. Each year's data set is compressed in bz2 format (e.g., 2008.csv.bz2). Do not uncompress the files into your directory. Instead, you will load each compressed file into Julia's memory and uncompress it on-the-fly to read it line-by-line. Use TranscodingStreams.jl in conjunction with CodecBzip2.jl to accomplish this. Your code should look like
```
using CodecBzip2

 for year in 2001:2008
     fn = string(year) * ".csv.bz2"
     stream = Bzip2DecompressorStream(open(fn))
     for line in eachline(stream)
         # do something...
     end
     close(stream)
 end
 # do something else...
```
\ Please do not put data files in Git. For grading purpose (reproducibility), we will assume that data files are in the same directory as the Jupyter notebook file.

Find out how many data points are in each year.

Hint: It took my laptop less than 10 minutes to import data and fit linear regression.