\(\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n)\) collects the eigenvalues of \(\mathbf{A}\).
Columns of \(\mathbf{X}\) are the eigenvectors.
Not all matrices have EVD: \[
\mathbf{A} = \begin{bmatrix} 2 & 1 & 0 \\ 0 & 2 & 1 \\ 0 & 0 & 2 \end{bmatrix}
\] has only a single independent eigenvector \(\mathbf{e}_1 = (1, 0, 0)^T\) associated with the multiple eigenvalue 2.
In most statistical applications, we deal with eigenvalues/eigenvectors of symmetric matrices. The eigenvalues and eigenvectors of a real symmetric matrix are real.
Eigenvalue decompostion of a symmetric matrix: \(\mathbf{A} = \mathbf{U} \Lambda \mathbf{U}^T\), where
columns of \(\mathbf{U}\) are the eigenvectors, which are (or can be chosen to be) mutually orthonormal. Thus \(\mathbf{U}\) is an orthogonal matrix.
A real symmetric matrix is positive semidefinite (positive definite) if and only if all eigenvalues are nonnegative (positive).
Schur decomposition: every square matrix \(\mathbf{A}\) can be decomposed \(\mathbf{A} = \mathbf{Q}\mathbf{T}\mathbf{Q}^*\)
\(\mathbf{T}\) is upper triangular, with eigenvalues on the diagonal.
\(\mathbf{Q}\) is unitary: \(\mathbf{Q}^*\mathbf{Q} = \mathbf{I}\).
Columns of \(\mathbf{Q}\) form a sequence of nested invariant spaces with respect to \(\mathbf{A}\): let \(\mathbf{Q} = [\mathbf{q}_1 | \dotsb | \mathbf{q}_n ]\). Then \(\text{span}(\mathbf{q}_1, \dotsc, \mathbf{q}_k) = \text{span}(\mathbf{A}\mathbf{q}_1, \dotsc, \mathbf{A}\mathbf{q}_k)\).
If \(\mathbf{A} \in \mathbb{R}^{n \times n}\) a square matrix (not necessarily symmetric), then \(\text{tr}(\mathbf{A}) = \sum_i \lambda_i\) and \(\det(\mathbf{A}) = \prod_i \lambda_i\).
Review: Singular Value Decomposition
For a rectangular matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\), let \(p = \min\{m,n\}\), then we have the SVD \[
\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T,
\] where
\(\mathbf{U} = (\mathbf{u}_1,\ldots,\mathbf{u}_m) \in \mathbb{R}^{m \times m}\) is orthogonal, i.e. \(\mathbf{U}^T\mathbf{U} = \mathbf{U}\mathbf{U}^T = \mathbf{I}_m\).
\(\mathbf{V} = (\mathbf{v}_1,\ldots,\mathbf{v}_n) \in \mathbb{R}^{n \times n}\) is orthogonal, i.e. \(\mathbf{V}^T\mathbf{V} = \mathbf{V}\mathbf{V}^T = \mathbf{I}_n\).
\(\Sigma = [\text{diag}(\sigma_1, \ldots, \sigma_p)~\mathbf{0}]~\text{or}~[\text{diag}(\sigma_1, \ldots, \sigma_p); \mathbf{0}] \in \mathbb{R}^{m \times n}\), \(\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_p \ge 0\). \(\sigma_i\) are called the singular values, \(\mathbf{u}_i\) are the left singular vectors, and \(\mathbf{v}_i\) are the right singular vectors.
Thin/skinny SVD. Assume \(m \ge n\). \(\mathbf{A}\) can be factored as \[
\mathbf{A} = \mathbf{U}_n \Sigma_n \mathbf{V}^T = \sum_{i=1}^n \sigma_i \mathbf{u}_i \mathbf{v}_i^T,
\] where
Using thin SVD, \(\mathbf{A}^T \mathbf{A} = \mathbf{V} \Sigma^2 \mathbf{V}^T\) and \(\mathbf{A} \mathbf{A}^T = \mathbf{U} \Sigma^2 \mathbf{U}^T\).
In principle we can obtain SVD of \(\mathbf{A}\) by doing two EVDs.
In fact, using thin SVD, \[
\small
\begin{bmatrix} \mathbf{0}_{n \times n} & \mathbf{A}^T \\ \mathbf{A} & \mathbf{0}_{m \times m} \end{bmatrix}
= \frac{1}{\sqrt 2}
\begin{bmatrix} \mathbf{V} & \mathbf{V} \\ \mathbf{U} & -\mathbf{U} \end{bmatrix}
\begin{bmatrix} \Sigma & \mathbf{0}_{n \times n} \\ \mathbf{0}_{n \times n} & - \Sigma \end{bmatrix}
\frac{1}{\sqrt 2}
\begin{bmatrix} \mathbf{V}^T & \mathbf{U}^T \\ \mathbf{V}^T & - \mathbf{U}^T \end{bmatrix}.
\] Hence any symmetric EVD solver can produce the SVD of a matrix \(\mathbf{A}\) without forming \(\mathbf{A} \mathbf{A}^T\) or \(\mathbf{A}^T \mathbf{A}\).
Applications of EVD and SVD
1. Principal Components Analysis
\(\mathbf{X} \in \mathbb{R}^{n \times p}\) is a centered data matrix.
Perform SVD \(\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^T\) or equivalently eigendecomposition \(\mathbf{X}^T \mathbf{X} = \mathbf{V} \Sigma^2 \mathbf{V}^T\). The linear combinations \(\tilde{\mathbf{x}}_i = \mathbf{X} \mathbf{v}_i\) are the principal components (PC) and have variance \(\sigma_i^2\).
Dimension reduction: reduce dimensionality \(p\) to \(q \ll p\). Use top PCs \(\tilde{\mathbf{x}}_1, \ldots, \tilde{\mathbf{x}}_q\) in visualization and downstream analysis.
Goal: Find a low rank approximation of data matrix \(\mathbf{X}\) in, e.g., image/data compression.
Eckart-Young theorem: \[
\min_{\text{rank}(\mathbf{Y})=r} \|\mathbf{X} - \mathbf{Y} \|_{\text{F}}^2
\] is achieved by \(\mathbf{Y} = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T\) with optimal value \(\sum_{i=r}^{p} \sigma_i^2\), where \((\sigma_i, \mathbf{u}_i, \mathbf{v}_i)\) are singular values and vectors of \(\mathbf{X}\).
Using thin SVD, \[
\mathbf{A}^{\dagger} = \mathbf{V} \Sigma^{\dagger} \mathbf{U}^T,
\] where \(\Sigma^{\dagger} = \text{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0)\), \(r= \text{rank}(\mathbf{A})\). This is how the pinv function is implemented in Julia.
When \(\mathbf{X}\) does not have full column rank, the Moore-Penrose pseudoinverse gives the minimum \(\ell_2\) norm solution to OLS (recall Lecture Notes on Linear Regression).
To see this, let \(\mathbf{X} = \sum_{i=1}^{\min(n,p)}\sigma_i\mathbf{u}_i\mathbf{v}_i^T\) (full SVD) , \(r=\text{rank}(\mathbf{X})\) and \(\mathbf{\beta} = \sum_{j=1}^p \alpha_j\mathbf{v}_j\).
Thus \(\hat{\alpha}_j = \frac{1}{\sigma_j}\mathbf{u}_j^T\mathbf{y}\) for \(j=1,\dotsc, r\) but is arbitrary for \(j>r\). The minimum \(\ell_2\) norm solution is given by setting \(\hat{\alpha}_j=0\) for \(j>r\), resulting in
In general, SVD is more expensive than other approaches (Cholesky, QR). In some applications, SVD is computed for other purposes. Then, we get least squares solution for free.
5. Ridge regression
In ridge regression, we minimize \[
\small
\|\mathbf{y} - \mathbf{X} \beta\|_2^2 + \lambda \|\beta\|_2^2,
\] where \(\lambda\) is a tuning parameter.
Ridge regression by augmented linear regression: Ridge regression problem is equivalent to \[
\small
\left\| \begin{pmatrix} \mathbf{y} \\ \mathbf{0}_p \end{pmatrix} - \begin{pmatrix}
\mathbf{X} \\ \sqrt \lambda \mathbf{I}_p
\end{pmatrix} \beta \right\|_2^2.
\] Therefore any methods for linear regression (e.g., LSQR) can be applied.
Ridge regression by method of normal equation: The normal equation for the ridge problem is \[
\small
(\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I}_p) \beta = \mathbf{X}^T \mathbf{y}.
\] Therefore Cholesky decomposition can be used.
Ridge regression by SVD: If we obtain the (thin) SVD of \(\mathbf{X}\)\[
\small
\mathbf{X} = \mathbf{U} \Sigma_{p} \mathbf{V}^T,
\] then the normal equation reads \[
\small
\mathbf{V}^T(\Sigma^2 + \lambda \mathbf{I}_p) \mathbf{V}^T \beta = \mathbf{V}^T\Sigma \mathbf{U}^T \mathbf{y}.
\]
It is clear that \[
\small
\begin{eqnarray*}
\lim_{\lambda \to 0} \widehat \beta (\lambda) = \mathbf{X}^{\dagger}\mathbf{y},
\end{eqnarray*}
\] the minimum \(\ell_2\) norm solution, and \(\|\widehat \beta (\lambda)\|_2\) is monotone decreasing as \(\lambda\) increases.
Only one SVD is needed for all \(\lambda\), in contrast to the method of augmented linear regression or Cholesky.
Algorithms for EVD
Any EVD solver must be iterative.
Abel (1824) showed that the real root of a polynomial of degree 5 or more with rational coefficients cannot be written using any expression involving rational numbers, addtion, subtraction, multiplication, division, and \(k\)th roots.
This means that no algorithm can produce the exact roots of the characteristic polynomial of a matrix, i.e., eigenvalues, in a finite number of steps.
“Direct” EVD solver refers to a solver that reduces to a general matrix to a special form in finite flops, and then apply iterative methods that converges to EVD.
For ease of exposition, we restrict attention to real symmetric matrices, i.e., we assume \(\mathbf{A}=\mathbf{A}^T \in \mathbb{R}^{n \times n}\).
One eigen-pair: power iteration
Iterates according to \[
\small
\begin{eqnarray*}
\mathbf{x}^{(t)} &\gets& \frac{1}{\|\mathbf{A} \mathbf{x}^{(t-1)}\|} \mathbf{A} \mathbf{x}^{(t-1)}
\end{eqnarray*}
\] from an initial guess \(\mathbf{x}^{(0)}\) of unit norm.
Obviously, \(\mathbf{x}^{(t)} \propto \mathbf{A}^t \mathbf{x}^{(0)}\), hence the name. (We normalize so that \(\Vert \mathbf{x}^{(t)} \Vert = 1\)).
Inverse iteration: Apply power iteration to \(\mathbf{A}^{-1}\) to find the eigen-pair of smallest absolute value. (May pre-compute LU or Cholesky of \(\mathbf{A}\)).
Shifted inverse iteration: Apply power iteration to \((\mathbf{A}-\mu\mathbf{I})^{-1}\) to find the eigen-pair of closed to \(\mu\).
Converges linearly to an eigenvalue close to the pre-specified \(\mu\).
Rayleigh quotient iteration: Substitute the shift \(\mu\) in shifted inverse iteration with Rayleigh quotient \(\mathbf{v}^{(t) T}\mathbf{A}\mathbf{v}^{(t)}/\mathbf{v}^{(t) T}\mathbf{v}^{(t)}\).
Converges cubically to the eigen-pair closest to \(\mathbf{v}^{(0)}\).
Example: PageRank problem seeks top left eigenvector of transition matrix \(\mathbf{P}\) and costs \(O(n)\) per iteration.
Top \(r\) eigen-pairs: orthogonal iteration
= simultaneous iteration, block power iteration
Motivation
Rewrite the power method: \[
\begin{eqnarray*}
\mathbf{z}^{(t)} &\gets& \mathbf{A} \mathbf{x}^{(t-1)} \\
(\mathbf{x}^{(t)}, \Vert \mathbf{z}^{(t)}\Vert) &\gets& \texttt{qr}(\mathbf{z}^{(t)}) \quad (\|\mathbf{z}^{(t)}\|\mathbf{x}^{(t)} = \mathbf{z}^{(t)})
\end{eqnarray*}
\] The second step is the first iteration of the Gram-Schmidt QR.
Algorithm
Initialize \(\tilde{\mathbf{Q}}^{(0)} \in \mathbb{R}^{n \times r}\) with orthonormal columns
Orthogonal iteration is a generalization of the power method to a higher dimensional invariant subspace.
It can be shown \(\tilde{\mathbf{Q}}^{(t)}\) converges to the eigenspace of the largest \(r\) eigenvalues if they are real and separated from remaining spectrum. The convergence rate is \(\max_{k=1,\dotsc,r}|\lambda_{k+1}|/|\lambda_k|\).
To see that orthogonal iteration is a generalization of the power method, observe that \[
\small
\begin{aligned}
\mathbf{A}\tilde{\mathbf{Q}}^{(0)} &= \mathbf{Z}^{(1)} = \tilde{\mathbf{Q}}^{(1)}\hat{\mathbf{R}}^{(1)}, \quad \hat{\mathbf{R}}^{(1)} = \tilde{\mathbf{R}}^{(1)} \\
\mathbf{A}^2\tilde{\mathbf{Q}}^{(0)} &= (\mathbf{A}\tilde{\mathbf{Q}}^{(1)})\hat{\mathbf{R}}^{(1)} \\
&= \mathbf{Z}^{(2)}\hat{\mathbf{R}}^{(1)} = \tilde{\mathbf{Q}}^{(2)}\tilde{\mathbf{R}}^{(2)}\hat{\mathbf{R}}^{(1)} = \tilde{\mathbf{Q}}^{(2)}\hat{\mathbf{R}}^{(2)} \\
\mathbf{A}^3\tilde{\mathbf{Q}}^{(0)} &= (\mathbf{A}\tilde{\mathbf{Q}}^{(2)})\hat{\mathbf{R}}^{(2)} \\
&= \mathbf{Z}^{(3)}\hat{\mathbf{R}}^{(2)} = \tilde{\mathbf{Q}}^{(3)}\tilde{\mathbf{R}}^{(3)}\hat{\mathbf{R}}^{(2)} = \tilde{\mathbf{Q}}^{(3)}\hat{\mathbf{R}}^{(3)} \\
& \vdots
\end{aligned}
\] So, \[
\small
\mathbf{A}^t\tilde{\mathbf{Q}}^{(0)} = \mathbf{Z}^{(t)}\hat{\mathbf{R}}^{(t-1)} = \tilde{\mathbf{Q}}^{(t)}\hat{\mathbf{R}}^{(t)},
\quad
\hat{\mathbf{R}}^{(t)} = \tilde{\mathbf{R}}^{(t)} \tilde{\mathbf{R}}^{(t-1)} \dotsb \tilde{\mathbf{R}}^{(1)}
.
\]
Since \(\hat{\mathbf{R}}^{(t)}\) is upper triangular (why?), \(\tilde{\mathbf{Q}}^{(t)}\hat{\mathbf{R}}^{(t)}\) is the reduced QR decomposition of \(\mathbf{A}^t\tilde{\mathbf{Q}}^{(0)}\), and \(\tilde{\mathbf{Q}}^{(t)}\) is orthonormal basis of \(\text{col}(\mathbf{A}^t\tilde{\mathbf{Q}}^{(0)})\).
It can be shown that QR iteration is equivalent to the orthogonal iteration on \(\mathbf{A}\) starting with \(\mathbf{Q}^{(0)}=\mathbf{I}_n\). Specifically, \[
\small
\begin{split}
\tilde{\mathbf{R}}^{(t)} &= \mathbf{R}^{(t)} \\
\tilde{\mathbf{Q}}^{(t)} &= \mathbf{Q}^{(1)} \mathbf{Q}^{(2)} \dotsb \mathbf{Q}^{(t)} \\
\mathbf{A}^{(t)} &= (\tilde{\mathbf{Q}}^{(t)})^T\mathbf{A}\tilde{\mathbf{Q}}^{(t)}
.
\end{split}
\]
Also note that \[
\small
\mathbf{A}^t = \tilde{\mathbf{Q}}^{(t)}\hat{\mathbf{R}}^{(t)}
.
\]
Take \(r=n\) in the orthogonal iteration. Then \(\tilde{\mathbf{Q}}^{(t)}\) converges to \(\mathbf{U}\) up to sign changes. This implies that \[
\small
\mathbf{A}^{(t)} = (\tilde{\mathbf{Q}}^{(t)})^T \mathbf{A} \tilde{\mathbf{Q}}^{(t)}
\] converges to a diagonal form \(\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)\).
Linear convergence rate.
QR iteration with shifts
Algorithm
\(\mathbf{A}^{(0)} \gets \mathbf{A}\)
For \(t=1,2,\ldots\),
Pick a shift \(\mu_t\) (e.g., \(\mu_t = \mathbf{A}_{nn}^{(t)}\))
Can be shown to be equivalent to simultaneous shifted inverse iteration starting with \[
\mathbf{Q}^{(0)} = \mathbf{P} = \begin{bmatrix} & & & 1 \\ & & 1 & \\ & \dotsc & & \\ 1 & & & \end{bmatrix}
\]
\(\mu_t = \mathbf{A}_{nn}^{(t)}\) is a good choice for the shift (and free to compute!), with cubic convergence (cf. Rayleigh quotient shift).
Practical QR iteration
Implemented in LAPACK: used by Julia, Matlab, R.
General QR iteration is expensive: \(O(n^3)\) per iteration.
Tridiagonalization (by Householder) + implicit shifted QR iteration on the tridiagonal matrix.
Direct phase: Householder tridiagonalization: \(4n^3/3\) for eigenvalues only, \(8n^3/3\) for both eigenvalues and eigenvectors. (Why can’t we apply Householder to make it diagonal directly?)
Iterative phase: Built on QR iteration with shifts. Tridiangonal structure allows the QR decomposition cheap (“implicit Q theorem”). On average 1.3-1.6 QR iteration per eigenvalue, \(\sim 20n\) flops per iteration. So total operation count is about \(30n^2\). Eigenvectors need an extra of about \(6n^3\) flops.
Phase
Eigenvalue
Eigenvector
Tridiagonal reduction
\(4n^3/3\)
+ \(4n^3/3\)
implicit shifted QR
\(\sim 30n^2\)
+ \(\sim 6n^3\)
Take-home message: Don’t request eigenvectors unless necessary. Use eigvals in Julia to request only eigenvalues.
Unsymmetric EVD
Reduction to the upper Hessenberg form (upper triangular + subdiagonal)
The unsymmetric QR algorithm obtains the real Schur decomposition of the Hessenberg matrix.
Implicit Q theorem still applies, but more expensive.
# requesting eigenvalues only is cheaper@benchmarkeigvals($A)
BenchmarkTools.Trial: 102 samples with 1 evaluation.
Range (min … max): 41.617 ms … 70.308 ms┊ GC (min … max): 0.00% … 6.26%
Time (median): 48.271 ms ┊ GC (median): 0.00%
Time (mean ± σ): 49.115 ms ± 5.018 ms┊ GC (mean ± σ): 1.41% ± 2.83%
▃▁ ▃ ▁ ▁ ▃█▆ ▁
▄▁██▆█▇█▆█▇▆▆██▇▇█▄▇▇▄▆▇█▆▆▇▆▇▁▆▁▆▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄ ▄
41.6 ms Histogram: frequency by time 67.1 ms <
Memory estimate: 7.99 MiB, allocs estimate: 11.
# requesting eigenvectors requires extra work@benchmarkeigen($A)
BenchmarkTools.Trial: 27 samples with 1 evaluation.
Range (min … max): 176.221 ms … 213.026 ms┊ GC (min … max): 0.00% … 1.51%
Time (median): 186.644 ms ┊ GC (median): 0.00%
Time (mean ± σ): 190.981 ms ± 13.316 ms┊ GC (mean ± σ): 0.74% ± 0.80%
█ █ ▃
█▇▇▇▇█▁▁▁▁▁▁▁▁▇▇▇▇▁▁▇▇▇▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇▁▁▁▁▇▇▁▁▁▁▇▁▇▁▁▇█ ▁
176 ms Histogram: frequency by time 213 ms <
Memory estimate: 23.25 MiB, allocs estimate: 13.
@benchmarkeigvecs($A) # same
BenchmarkTools.Trial: 27 samples with 1 evaluation.
Range (min … max): 174.911 ms … 339.120 ms┊ GC (min … max): 0.00% … 1.17%
Time (median): 179.500 ms ┊ GC (median): 0.00%
Time (mean ± σ): 188.660 ms ± 32.140 ms┊ GC (mean ± σ): 0.76% ± 0.81%
█▆▁
███▆▁▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
175 ms Histogram: frequency by time 339 ms <
Memory estimate: 23.25 MiB, allocs estimate: 13.
Algorithm for SVD
Assume \(\mathbf{A} \in \mathbb{R}^{m \times n}\) and we seek the SVD \(\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T\).
Golub-Kahan-Reinsch algorithm
Stage 1: Transform \(\mathbf{A}\) to an upper bidiagonal form \(\mathbf{B}\) (by Householder).
Bidiagonalization
Bidiagonalization 2
Stage 2: Apply implicit shifted QR iteration to the tridiagonal matrix \(\mathbf{B}^T\mathbf{B}\)without explicitly forming it. That is, \(\mathbf{B}^{(t)}\) is directly formed from \(\mathbf{B}^{(t-1)}\).
\(4m^2 n + 8mn^2 + 9n^3\) flops for a tall \((m \ge n)\) matrix.
Random.seed!(280)n, p =1000, 500A =randn(n, p)@benchmarksvdvals(A)
BenchmarkTools.Trial: 127 samples with 1 evaluation.
Range (min … max): 35.415 ms … 49.818 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.556 ms ┊ GC (median): 0.00%
Time (mean ± σ): 39.411 ms ± 3.157 ms┊ GC (mean ± σ): 0.97% ± 2.85%
▁ ▃▄▃▃▃ ▁▄ █▃▃▃▁ ▁ ▁ ▁
█▇▄█████▆▆▆██▇█████▆▆█▆▆▆▆▄▁▁▁▁▁▆▆▁▁▄▁▄█▄▆▄█▄▄▁▁▄▁▆▁▄▆▄▄▁▁▄ ▄
35.4 ms Histogram: frequency by time 47.2 ms <
Memory estimate: 4.11 MiB, allocs estimate: 9.
@benchmarksvd(A)
BenchmarkTools.Trial: 59 samples with 1 evaluation.
Range (min … max): 72.871 ms … 103.514 ms┊ GC (min … max): 0.00% … 3.23%
Time (median): 82.671 ms ┊ GC (median): 0.00%
Time (mean ± σ): 86.051 ms ± 9.452 ms┊ GC (mean ± σ): 1.44% ± 1.89%
▁ ▁ ▁ ▄▄ ▄▁▁▁ ██ ▁ ▁ ▁ ▁▁▁
█▁█▁▆▁▆▆█▁▆██▆████▆█▆█▆▆▆█▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▆▁▁▆▆▁█▆▁█▁▁▆▆███ ▁
72.9 ms Histogram: frequency by time 103 ms <
Memory estimate: 17.23 MiB, allocs estimate: 12.
Jacobi’s method for symmetric EVD
One of the oldest ideas for computing eigenvalues, by Jacobi in 1845.
Assume \(\mathbf{A} \in \mathbf{R}^{n \times n}\) is symmetric and we seek the EVD of \(\mathbf{A} = \mathbf{U} \Lambda \mathbf{U}^T\).
Idea: diagonalize a 2x2 matrix by similarity transform using an orthogonal matrix:
\[
\small
\begin{bmatrix} c & -s \\ s & c \end{bmatrix}
\begin{bmatrix} a & d \\ d & b \end{bmatrix}
\begin{bmatrix} c & s \\ -s & c \end{bmatrix}
=
\begin{bmatrix} * & 0 \\ 0 & * \end{bmatrix}
\]
It can be easily seen that \(c=\cos\theta\) and \(s=\sin\theta\) satisifies \[
\small
\tan(2\theta) = \frac{2d}{b-a}
\] if \(a\neq b\), otherwise \(c=s=1/\sqrt{2}\).
Classical Jacobi: search for the largest \(|a_{ij}|\) at each iteration. \(O(n^2)\) efforts!
\(\text{off}(\mathbf{A}) \le n(n-1) a_{ij}^2\) and \(\text{off}(\mathbf{B}) = \text{off}(\mathbf{A}) - 2 a_{ij}^2\) together implies \[
\text{off}(\mathbf{B}) \le \left( 1 - \frac{2}{n(n-1)} \right) \text{off}(\mathbf{A}).
\] Thus Jacobi’s method converges at least linearly.
In practice, cyclic-by-row implementation, to avoid the costly \(O(n^2)\) search in the classical Jacobi.
[J(p, q, theta) for p=1:n-1, q=p+1:n]
Jacobi method attracts a lot recent attention because of its rich inherent parallelism (and does not need tridiagonal reduction).
Krylov subspace methods for top eigen-pairs
State-of-art iterative methods for obtaining the top eigenvalues/vectors or singular values/vectors of large sparse or structured matrices.
Recall the PageRank problem. We want to find the top left eigenvector of the transition matrix \(\mathbf{P}\) of size \(10^9\) by \(10^9\). Direct methods such as (unsymmetric) QR or SVD takes forever. Iterative methods such as power method is feasible. However power method may take a large number of iterations.
Motivation
Recall the Rayleigh quotient \[
r(\mathbf{x}) = \frac{\mathbf{x}^T\mathbf{A}\mathbf{x}}{\mathbf{x}^T\mathbf{x}}
\]
Note \(r(\mathbf{u}_i) = \lambda_i\).
Suppose \(\mathbf{q}_1, \dotsc, \mathbf{q}_n\) form an orthonormal basis of \(\mathbb{R}^n\). Define \[
M_k \triangleq \max_{\mathbf{u}\in\text{span}(\mathbf{q}_1,\dotsc,\mathbf{q}_k)} r(\mathbf{u}),
\quad
\text{and}
\quad
m_k \triangleq \min_{\mathbf{v}\in\text{span}(\mathbf{q}_1,\dotsc,\mathbf{q}_k)} r(\mathbf{v}).
\]
The goal is to choose \(\mathbf{q}_1, \mathbf{q}_2, \dotsc\) so that \(M_k \approx \lambda_1\).
Consider the gradient of the Rayleigh quotient: \[
\nabla r(\mathbf{x}) = \frac{2}{\mathbf{x}^T\mathbf{x}}(\mathbf{A}\mathbf{x} - r(\mathbf{x})\mathbf{x}).
\]
Suppose \(M_k = r(\mathbf{v}_k)\) for some \(\mathbf{v}_k \in \text{span}(\mathbf{q}_1,\dotsc,\mathbf{q}_k)\). If \(\nabla r(\mathbf{v}_k) = 0\), we are done. Otherwise, it is reasonable to choose the next vector \(\mathbf{q}_{k+1}\) so that \[
\nabla r(\mathbf{v}_k) \in \text{span}(\mathbf{q}_1,\dotsc,\mathbf{q}_k, \mathbf{q}_{k+1})
\] because \(r(\mathbf{x})\) increases steepest in this direction.
Since \(\nabla r(\mathbf{v}_k) \in \text{span}(\mathbf{x}, \mathbf{A}\mathbf{x})\), the above proposal is satisfied if \[
\text{span}(\mathbf{q}_1,\dotsc,\mathbf{q}_k) = \text{span}(\mathbf{q}_1, \mathbf{A}\mathbf{q}_1, \dotsc, \mathbf{A}^{k-1}\mathbf{q}_1) = \mathcal{K}^k(\mathbf{A}, \mathbf{q}_1).
\]
Finding \(\mathbf{q}_1, \mathbf{q}_2, \dotsc\) amounts to computing orthonormal bases for \(\mathcal{K}^k(\mathbf{A}, \mathbf{q}_1)\) (Gram-Schmidt).
The eigenvalues of the \(k\times k\) matrix \(\mathbf{Q}_k^T\mathbf{A}\mathbf{Q}_k\) (either tridiagonal or upper Hessenberg), call the Ritz values, approximate the top \(k\) eigenvalues of \(\mathbf{A}\).
Instances
Lanczos method: top eigen-pairs of a large symmetric matrix. (Use tridiagonal reduction \(\mathbf{Q}^T\mathbf{A}\mathbf{Q} = \mathbf{T}\).)
Arnoldi method: top eigen-pairs of a large asymmetric matrix. (Use Hessenberg reduction \(\mathbf{Q}^T\mathbf{A}\mathbf{Q} = \mathbf{H}\).)
Both methods are also adapted to obtain top singular values/vectors of large sparse or structured matrices.
eigs and svds in Julia Arpack.jl package and Matlab are wrappers of the ARPACK package, which implements Lanczos and Arnoldi methods.
usingMatrixDepot, SparseArrays# Download the Boeing/bcsstk34 matrix (sparse, pd, 588-by-588) from SuiteSparse collection# https://sparse.tamu.edu/BoeingA =matrixdepot("Boeing/bcsstk34")# Change type of A from Symmetric{Float64,SparseMatrixCSC{Float64,Int64}} to SparseMatrixCSCA =sparse(A)@showtypeof(A)Afull =Matrix(A)@showtypeof(Afull)# actual sparsity levelcount(!iszero, A) /length(A)