LU decomposition (Gaussian Elimination) is not used in statistics very often because most of time statisticians deal with positive (semi)definite matrices.
Recall that a matrix \(M\) is positive (semi)definite if \[
\mathbf{x}^T\mathbf{M}\mathbf{x} > 0 ~(\ge 0), \quad \forall \mathbf{x}\neq\mathbf{0}.
\]
Example: normal equation \[
\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}
\] for linear regression, the coefficient matrix \(\mathbf{X}^T \mathbf{X}\) is symmetric and positive semidefinite. How to exploit this structure?
Most of time statisticians deal with positive (semi)definite matrices.
Cholesky decomposition
Theorem: Let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be symmetric and positive definite. Then \(\mathbf{A} = \mathbf{L} \mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular with positive diagonal entries and is unique.
Proof (by induction):
If \(n=1\), then \(0 < a = \sqrt{a}\sqrt{a}\). For \(n>1\), the block equation \[
\begin{bmatrix}
a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22}
\end{bmatrix} =
\begin{bmatrix}
\ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{b} & \mathbf{L}_{22}
\end{bmatrix}
\begin{bmatrix}
\ell_{11} & \mathbf{b}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T
\end{bmatrix}
\] entails
\[
\begin{eqnarray*}
a_{11} &=& \ell_{11}^2 \\
\mathbf{a} &=& \ell_{11} \mathbf{b} \\
\mathbf{A}_{22} &=& \mathbf{b} \mathbf{b}^T + \mathbf{L}_{22} \mathbf{L}_{22}^T
.
\end{eqnarray*}
\]
Since \(a_{11}>0\) (why?), \(\ell_{11}=\sqrt{a_{11}}\) and \(\mathbf{b}=a_{11}^{-1/2}\mathbf{a}\) are uniquely determined. \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T\) is positive definite of size \((n-1)\times(n-1)\) because \(\mathbf{A}_{22}\) is positive definite (why? homework). By induction hypothesis, lower triangular \(\mathbf{L}_{22}\) such that \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{L}_{22}^T\mathbf{L}_{22}\) exists and is unique.
This proof is constructive and completely specifies the algorithm:
In general Cholesky decomposition is very stable. Failure of the decomposition simply means \(\mathbf{A}\) is not positive definite. It is an efficient way to test positive definiteness.
Pivoting
Pivoting is only needed if you want the Cholesky factor of a positive semidefinite matrix.
When \(\mathbf{A}\) does not have full rank, e.g., \(\mathbf{X}^T \mathbf{X}\) with a non-full column rank \(\mathbf{X}\), we encounter \(a_{kk} = 0\) during the procedure.
Symmetric pivoting. At each stage \(k\), we permute both row and column such that \(\max_{k \le i \le n} a_{ii}\) becomes the pivot. If we encounter \(\max_{k \le i \le n} a_{ii} = 0\), then \(\mathbf{A}[k:n,k:n] = \mathbf{0}\) (why?) and the algorithm terminates.
With symmetric pivoting: \[
\mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T,
\] where \(\mathbf{P}\) is a permutation matrix and \(\mathbf{L} \in \mathbb{R}^{n \times r}\), \(r = \text{rank}(\mathbf{A})\).
# P A P' = L Unorm(transpose(Achol.P) * A * Achol.P - Achol.L * Achol.U)
5.438959822042073e-16
Applications
No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.
Multivariate normal density
Multivariate normal density \(\mathcal{N}(0, \Sigma)\), where \(\Sigma\) is \(n\times n\) positive definite, is \[
\small
\frac{1}{(2\pi)^{n/2}\det(\Sigma)^{1/2}}\exp\left(-\frac{1}{2}\mathbf{y}^T\Sigma^{-1}\mathbf{y}\right).
\]
Solve \(\mathbf{L} \mathbf{x} = \mathbf{y}\) by forward substitutions (\(n^2\) flops),
compute quadratic form \(\mathbf{x}^T \mathbf{x}\) (\(2n\) flops)
compute determinant from Cholesky factor (\(n\) flops).
Which method is better?
# word-by-word transcription of mathematical expressionfunctionlogpdf_mvn_1(y::Vector, Σ::Matrix) n =length(y)- (n//2) *log(2π) - (1//2) *logdet(Σ) - (1//2) *transpose(y) *inv(Σ) * yend# you learnt why you should avoid inversionfunctionlogpdf_mvn_2(y::Vector, Σ::Matrix) n =length(y) Σchol =cholesky(Symmetric(Σ))- (n//2) *log(2π) - (1//2) *logdet(Σchol) - (1//2) *dot(y, Σchol \ y)end# this is for the efficiency-concerned functionlogpdf_mvn_3(y::Vector, Σ::Matrix) n =length(y) Σchol =cholesky(Symmetric(Σ))- (n//2) *log(2π) -sum(log.(diag(Σchol.L))) - (1//2) *sum(abs2, Σchol.L \ y)end
logpdf_mvn_3 (generic function with 1 method)
usingBenchmarkTools, Distributions, RandomRandom.seed!(123) # seedn =1000# a pd matrixΣ =convert(Matrix{Float64}, Symmetric([i * (n - j +1) for i in1:n, j in1:n]))y =rand(MvNormal(Σ)) # one random sample from N(0, Σ)# at least they give same answer@showlogpdf_mvn_1(y, Σ)@showlogpdf_mvn_2(y, Σ)@showlogpdf_mvn_3(y, Σ);
BenchmarkTools.Trial: 118 samples with 1 evaluation.
Range (min … max): 33.045 ms … 74.940 ms┊ GC (min … max): 0.00% … 4.58%
Time (median): 37.421 ms ┊ GC (median): 0.00%
Time (mean ± σ): 42.534 ms ± 9.247 ms┊ GC (mean ± σ): 2.39% ± 3.48%
▆▁ ▁█▂▃▁▂ ▂
█████████▃▆▃▁▁▃▁▃▁▁▁▁▁▁▃▁▁▁▁▃▄▆█▄██▃▄▇▃▆▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▃▁▃ ▃
33 ms Histogram: frequency by time 66.1 ms <
Memory estimate: 15.77 MiB, allocs estimate: 9.
@benchmarklogpdf_mvn_2($y, $Σ)
BenchmarkTools.Trial: 359 samples with 1 evaluation.
Range (min … max): 10.062 ms … 37.556 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.814 ms ┊ GC (median): 0.00%
Time (mean ± σ): 14.179 ms ± 4.300 ms┊ GC (mean ± σ): 3.44% ± 7.04%
▂▃█▂▁
▃█████▅▃▂▃▄▄▄▄▅▃▃▃▂▃▄▁▃▃▃▄▄▃▃▃▁▅▃▄▃▃▃▃▃▁▂▁▁▃▂▃▃▃▂▁▂▁▂▂▂▃▂▃▃ ▃
10.1 ms Histogram: frequency by time 26.1 ms <
Memory estimate: 7.64 MiB, allocs estimate: 3.
@benchmarklogpdf_mvn_3($y, $Σ)
BenchmarkTools.Trial: 314 samples with 1 evaluation.
Range (min … max): 13.647 ms … 98.125 ms┊ GC (min … max): 0.00% … 3.41%
Time (median): 15.424 ms ┊ GC (median): 8.38%
Time (mean ± σ): 15.925 ms ± 6.336 ms┊ GC (mean ± σ): 8.70% ± 8.56%
▇▄▆▃ ▃▆▇█
████▆▃▄▃████▆▄▄▃▁▂▂▃▄▂▂▄▃▁▁▂▂▁▂▁▁▂▁▁▂▁▁▁▁▁▃▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▂ ▃
13.6 ms Histogram: frequency by time 25.3 ms <
Memory estimate: 22.91 MiB, allocs estimate: 11.
To evaluate same multivariate normal density at many observations \(y_1, y_2, \ldots\), we pre-compute the Cholesky decomposition (\(n^3/3\) flops), then each evaluation costs \(n^2\) flops.
Linear regression by Cholesky
Cholesky decomposition is one approach to solve linear regression. Assume \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and \(\mathbf{y} \in \mathbb{R}^n\).