Cholesky Decomposition

Advanced Statistical Computing

Joong-Ho Won

Seoul National University

October 2023

versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 2 on 8 virtual cores

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

Status `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/08-chol/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [31c24e10] Distributions v0.25.102

Cholesky Decomposition

André-Louis Cholesky, 1875-1918

Source https://en.wikipedia.org/wiki/André-Louis_Cholesky

A basic tenet in numerical analysis:

The structure should be exploited whenever possible in solving a problem.

Common structures include: symmetry, positive (semi)definiteness, sparsity, Kronecker product, low rank, …

LU decomposition (Gaussian Elimination) is not used in statistics very often because most of time statisticians deal with positive (semi)definite matrices.

Recall that a matrix \(M\) is positive (semi)definite if \[ \mathbf{x}^T\mathbf{M}\mathbf{x} > 0 ~(\ge 0), \quad \forall \mathbf{x}\neq\mathbf{0}. \]
Example: normal equation \[ \mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y} \] for linear regression, the coefficient matrix \(\mathbf{X}^T \mathbf{X}\) is symmetric and positive semidefinite. How to exploit this structure?
Most of time statisticians deal with positive (semi)definite matrices.

Cholesky decomposition

Theorem: Let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be symmetric and positive definite. Then \(\mathbf{A} = \mathbf{L} \mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular with positive diagonal entries and is unique.

Proof (by induction):
If \(n=1\), then \(0 < a = \sqrt{a}\sqrt{a}\). For \(n>1\), the block equation \[ \begin{bmatrix} a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22} \end{bmatrix} = \begin{bmatrix} \ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{b} & \mathbf{L}_{22} \end{bmatrix} \begin{bmatrix} \ell_{11} & \mathbf{b}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T \end{bmatrix} \] entails

\[ \begin{eqnarray*} a_{11} &=& \ell_{11}^2 \\ \mathbf{a} &=& \ell_{11} \mathbf{b} \\ \mathbf{A}_{22} &=& \mathbf{b} \mathbf{b}^T + \mathbf{L}_{22} \mathbf{L}_{22}^T . \end{eqnarray*} \]
Since \(a_{11}>0\) (why?), \(\ell_{11}=\sqrt{a_{11}}\) and \(\mathbf{b}=a_{11}^{-1/2}\mathbf{a}\) are uniquely determined. \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T\) is positive definite of size \((n-1)\times(n-1)\) because \(\mathbf{A}_{22}\) is positive definite (why? homework). By induction hypothesis, lower triangular \(\mathbf{L}_{22}\) such that \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{L}_{22}^T\mathbf{L}_{22}\) exists and is unique.

This proof is constructive and completely specifies the algorithm:

for k=1:n
    for j=k+1:n
        a_jk_div_a_kk = A[j, k] / A[k, k] 
        for i=j:n
            A[i, j] -= A[i, k] * a_jk_div_a_kk    # L_22
        end
    end
    sqrt_akk = sqrt(A[k, k])
    for i=k:n
        A[i, k] /= sqrt_akk    # l_11 and b
    end
end

Computational cost: \(\frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3\) flops + \(n\) square roots.
- Half the cost of LU decomposition.

In general Cholesky decomposition is very stable. Failure of the decomposition simply means \(\mathbf{A}\) is not positive definite. It is an efficient way to test positive definiteness.

Pivoting

Pivoting is only needed if you want the Cholesky factor of a positive semidefinite matrix.
When \(\mathbf{A}\) does not have full rank, e.g., \(\mathbf{X}^T \mathbf{X}\) with a non-full column rank \(\mathbf{X}\), we encounter \(a_{kk} = 0\) during the procedure.
Symmetric pivoting. At each stage \(k\), we permute both row and column such that \(\max_{k \le i \le n} a_{ii}\) becomes the pivot. If we encounter \(\max_{k \le i \le n} a_{ii} = 0\), then \(\mathbf{A}[k:n,k:n] = \mathbf{0}\) (why?) and the algorithm terminates.

With symmetric pivoting: \[ \mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T, \] where \(\mathbf{P}\) is a permutation matrix and \(\mathbf{L} \in \mathbb{R}^{n \times r}\), \(r = \text{rank}(\mathbf{A})\).

Implementation

LAPACK functions: ?potrf (without pivoting), ?pstrf (with pivoting).
Julia functions: cholesky, cholesky!, or call LAPACK wrapper functions potrf! and pstrf!

Example: positive definite matrix

using LinearAlgebra

A = Float64.([4 12 -16; 12 37 -43; -16 -43 98])  # check out this is pd!

3×3 Matrix{Float64}:
   4.0   12.0  -16.0
  12.0   37.0  -43.0
 -16.0  -43.0   98.0

# Cholesky without pivoting
Achol = cholesky(Symmetric(A))

Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0

typeof(Achol)

Cholesky{Float64, Matrix{Float64}}

fieldnames(typeof(Achol))

(:factors, :uplo, :info)

# retrieve the lower triangular Cholesky factor
Achol.L

3×3 LowerTriangular{Float64, Matrix{Float64}}:
  2.0   ⋅    ⋅ 
  6.0  1.0   ⋅ 
 -8.0  5.0  3.0

# retrieve the upper triangular Cholesky factor
Achol.U

3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0

b = [1.0; 2.0; 3.0]
A \ b # this does LU; wasteful!; 2/3 n^3 + 2n^2

3-element Vector{Float64}:
 28.58333333333338
 -7.666666666666679
  1.3333333333333353

Achol \ b # two triangular solves; only 2n^2 flops

3-element Vector{Float64}:
 28.583333333333332
 -7.666666666666666
  1.3333333333333333

det(A) # this actually does LU; wasteful!

35.99999999999994

det(Achol) # cheap

36.0

inv(A) # this does LU!

3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

inv(Achol) # 2n triangular solves

3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

Example: positive semidefinite matrix

using Random

Random.seed!(123) # seed
A = randn(5, 3)
A = A * transpose(A) # A has rank 3

5×5 Matrix{Float64}:
  0.726688  -0.982622  -1.0597   -0.58302    0.41867
 -0.982622   1.45895    1.96103   0.931141  -0.594176
 -1.0597     1.96103    4.07717   1.96485   -1.09925
 -0.58302    0.931141   1.96485   1.35893   -0.880852
  0.41867   -0.594176  -1.09925  -0.880852   0.607154

Achol = cholesky(A, Val(true)) # 2nd argument requests pivoting

LoadError: RankDeficientException(1)

Achol = cholesky(A, Val(true), check=false) # turn off checking pd

CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}
U factor with rank 4:
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.05367e-8
  ⋅       ⋅          ⋅           ⋅          -4.44089e-16
permutation:
5-element Vector{Int64}:
 3
 2
 4
 5
 1

rank(Achol) # determine rank from Cholesky factor

rank(A) # determine rank from SVD, which is more numerically stable

Achol.L

5×5 LowerTriangular{Float64, Matrix{Float64}}:
  2.0192      ⋅           ⋅         ⋅            ⋅ 
  0.971191   0.718149     ⋅         ⋅            ⋅ 
  0.973082  -0.0193672   0.641612   ⋅            ⋅ 
 -0.544399  -0.0911515  -0.549978  1.05367e-8    ⋅ 
 -0.524812  -0.658539   -0.132617  1.05367e-8  -4.44089e-16

Achol.U

5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.05367e-8
  ⋅       ⋅          ⋅           ⋅          -4.44089e-16

Achol.p

5-element Vector{Int64}:
 3
 2
 4
 5
 1

Achol.P   # this returns P'

5×5 Matrix{Float64}:
 0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0

# P A P' = L U
norm(transpose(Achol.P) * A * Achol.P - Achol.L * Achol.U)

5.438959822042073e-16

Applications

No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.

Multivariate normal density

Multivariate normal density \(\mathcal{N}(0, \Sigma)\), where \(\Sigma\) is \(n\times n\) positive definite, is \[ \small \frac{1}{(2\pi)^{n/2}\det(\Sigma)^{1/2}}\exp\left(-\frac{1}{2}\mathbf{y}^T\Sigma^{-1}\mathbf{y}\right). \]

Hence the log likelihood is \[ \small - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}. \]

Method 1:
1. compute explicit inverse \(\Sigma^{-1}\) (\(2n^3\) flops),
2. compute quadratic form (\(2n^2 + 2n\) flops),
3. compute determinant (\(2n^3/3\) flops).
Method 2:
1. Cholesky decomposition \(\Sigma = \mathbf{L} \mathbf{L}^T\) (\(n^3/3\) flops),
2. Solve \(\mathbf{L} \mathbf{x} = \mathbf{y}\) by forward substitutions (\(n^2\) flops),
3. compute quadratic form \(\mathbf{x}^T \mathbf{x}\) (\(2n\) flops)
4. compute determinant from Cholesky factor (\(n\) flops).

Which method is better?

# word-by-word transcription of mathematical expression
function logpdf_mvn_1(y::Vector, Σ::Matrix)
    n = length(y)
    - (n//2) * log(2π) - (1//2) * logdet(Σ) - (1//2) * transpose(y) * inv(Σ) * y
end

# you learnt why you should avoid inversion
function logpdf_mvn_2(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * dot(y, Σchol \ y)
end

# this is for the efficiency-concerned  
function logpdf_mvn_3(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - sum(log.(diag(Σchol.L))) - (1//2) * sum(abs2, Σchol.L \ y)
end

logpdf_mvn_3 (generic function with 1 method)

using BenchmarkTools, Distributions, Random

Random.seed!(123) # seed

n = 1000
# a pd matrix
Σ = convert(Matrix{Float64}, Symmetric([i * (n - j + 1) for i in 1:n, j in 1:n]))
y = rand(MvNormal(Σ)) # one random sample from N(0, Σ)

# at least they give same answer
@show logpdf_mvn_1(y, Σ)
@show logpdf_mvn_2(y, Σ)
@show logpdf_mvn_3(y, Σ);

logpdf_mvn_1(y, Σ) = -4856.961087775249
logpdf_mvn_2(y, Σ) = -4856.961087775317
logpdf_mvn_3(y, Σ) = -4856.961087775318

@benchmark logpdf_mvn_1($y, $Σ)

BenchmarkTools.Trial: 118 samples with 1 evaluation.
 Range (min … max):  33.045 ms … 74.940 ms  ┊ GC (min … max): 0.00% … 4.58%
 Time  (median):     37.421 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.534 ms ±  9.247 ms  ┊ GC (mean ± σ):  2.39% ± 3.48%
  ▆▁ ▁█▂▃▁▂                      ▂                             
  █████████▃▆▃▁▁▃▁▃▁▁▁▁▁▁▃▁▁▁▁▃▄▆█▄██▃▄▇▃▆▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▃▁▃ ▃
  33 ms           Histogram: frequency by time        66.1 ms <
 Memory estimate: 15.77 MiB, allocs estimate: 9.

@benchmark logpdf_mvn_2($y, $Σ)

BenchmarkTools.Trial: 359 samples with 1 evaluation.
 Range (min … max):  10.062 ms … 37.556 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.814 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.179 ms ±  4.300 ms  ┊ GC (mean ± σ):  3.44% ± 7.04%
   ▂▃█▂▁                                                       
  ▃█████▅▃▂▃▄▄▄▄▅▃▃▃▂▃▄▁▃▃▃▄▄▃▃▃▁▅▃▄▃▃▃▃▃▁▂▁▁▃▂▃▃▃▂▁▂▁▂▂▂▃▂▃▃ ▃
  10.1 ms         Histogram: frequency by time        26.1 ms <
 Memory estimate: 7.64 MiB, allocs estimate: 3.

@benchmark logpdf_mvn_3($y, $Σ)

BenchmarkTools.Trial: 314 samples with 1 evaluation.
 Range (min … max):  13.647 ms … 98.125 ms  ┊ GC (min … max): 0.00% … 3.41%
 Time  (median):     15.424 ms              ┊ GC (median):    8.38%
 Time  (mean ± σ):   15.925 ms ±  6.336 ms  ┊ GC (mean ± σ):  8.70% ± 8.56%
  ▇▄▆▃    ▃▆▇█                                                 
  ████▆▃▄▃████▆▄▄▃▁▂▂▃▄▂▂▄▃▁▁▂▂▁▂▁▁▂▁▁▂▁▁▁▁▁▃▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▂ ▃
  13.6 ms         Histogram: frequency by time        25.3 ms <
 Memory estimate: 22.91 MiB, allocs estimate: 11.

To evaluate same multivariate normal density at many observations \(y_1, y_2, \ldots\), we pre-compute the Cholesky decomposition (\(n^3/3\) flops), then each evaluation costs \(n^2\) flops.

Linear regression by Cholesky

Cholesky decomposition is one approach to solve linear regression. Assume \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and \(\mathbf{y} \in \mathbb{R}^n\).
- Compute \(\mathbf{X}^T \mathbf{X}\): \(np^2\) flops
- Compute \(\mathbf{X}^T \mathbf{y}\): \(2np\) flops
- Cholesky decomposition of \(\mathbf{X}^T \mathbf{X}\): \(\frac{1}{3} p^3\) flops
- Solve normal equation \(\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}\): \(2p^2\) flops
- If need standard errors, another \((4/3)p^3\) flops

Total computational cost is \(np^2 + (1/3) p^3\) (without s.e.) or \(np^2 + (5/3) p^3\) (with s.e.) flops.

Cholesky Decomposition

Cholesky Decomposition

Cholesky decomposition

Pivoting

Implementation

Example: positive definite matrix

Example: positive semidefinite matrix

Applications

Multivariate normal density

Which method is better?

Linear regression by Cholesky

Further reading