Cholesky Decomposition

Advanced Statistical Computing

Joong-Ho Won

Seoul National University

October 2023

versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 2 on 8 virtual cores
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
Status `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/08-chol/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [31c24e10] Distributions v0.25.102

Cholesky Decomposition

André-Louis Cholesky, 1875-1918

Source https://en.wikipedia.org/wiki/André-Louis_Cholesky

  • A basic tenet in numerical analysis:

The structure should be exploited whenever possible in solving a problem.

Common structures include: symmetry, positive (semi)definiteness, sparsity, Kronecker product, low rank, …

  • LU decomposition (Gaussian Elimination) is not used in statistics very often because most of time statisticians deal with positive (semi)definite matrices.
  • Recall that a matrix \(M\) is positive (semi)definite if \[ \mathbf{x}^T\mathbf{M}\mathbf{x} > 0 ~(\ge 0), \quad \forall \mathbf{x}\neq\mathbf{0}. \]

  • Example: normal equation \[ \mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y} \] for linear regression, the coefficient matrix \(\mathbf{X}^T \mathbf{X}\) is symmetric and positive semidefinite. How to exploit this structure?

  • Most of time statisticians deal with positive (semi)definite matrices.

Cholesky decomposition

Theorem: Let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be symmetric and positive definite. Then \(\mathbf{A} = \mathbf{L} \mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular with positive diagonal entries and is unique.

Proof (by induction):
If \(n=1\), then \(0 < a = \sqrt{a}\sqrt{a}\). For \(n>1\), the block equation \[ \begin{bmatrix} a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22} \end{bmatrix} = \begin{bmatrix} \ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{b} & \mathbf{L}_{22} \end{bmatrix} \begin{bmatrix} \ell_{11} & \mathbf{b}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T \end{bmatrix} \] entails

\[ \begin{eqnarray*} a_{11} &=& \ell_{11}^2 \\ \mathbf{a} &=& \ell_{11} \mathbf{b} \\ \mathbf{A}_{22} &=& \mathbf{b} \mathbf{b}^T + \mathbf{L}_{22} \mathbf{L}_{22}^T . \end{eqnarray*} \]
Since \(a_{11}>0\) (why?), \(\ell_{11}=\sqrt{a_{11}}\) and \(\mathbf{b}=a_{11}^{-1/2}\mathbf{a}\) are uniquely determined. \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T\) is positive definite of size \((n-1)\times(n-1)\) because \(\mathbf{A}_{22}\) is positive definite (why? homework). By induction hypothesis, lower triangular \(\mathbf{L}_{22}\) such that \(\mathbf{A}_{22} - \mathbf{b} \mathbf{b}^T = \mathbf{L}_{22}^T\mathbf{L}_{22}\) exists and is unique.

  • This proof is constructive and completely specifies the algorithm:
for k=1:n
    for j=k+1:n
        a_jk_div_a_kk = A[j, k] / A[k, k] 
        for i=j:n
            A[i, j] -= A[i, k] * a_jk_div_a_kk    # L_22
        end
    end
    sqrt_akk = sqrt(A[k, k])
    for i=k:n
        A[i, k] /= sqrt_akk    # l_11 and b
    end
end
  • Computational cost: \(\frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3\) flops + \(n\) square roots.
    • Half the cost of LU decomposition.
  • In general Cholesky decomposition is very stable. Failure of the decomposition simply means \(\mathbf{A}\) is not positive definite. It is an efficient way to test positive definiteness.

Pivoting

  • Pivoting is only needed if you want the Cholesky factor of a positive semidefinite matrix.

  • When \(\mathbf{A}\) does not have full rank, e.g., \(\mathbf{X}^T \mathbf{X}\) with a non-full column rank \(\mathbf{X}\), we encounter \(a_{kk} = 0\) during the procedure.

  • Symmetric pivoting. At each stage \(k\), we permute both row and column such that \(\max_{k \le i \le n} a_{ii}\) becomes the pivot. If we encounter \(\max_{k \le i \le n} a_{ii} = 0\), then \(\mathbf{A}[k:n,k:n] = \mathbf{0}\) (why?) and the algorithm terminates.

  • With symmetric pivoting: \[ \mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T, \] where \(\mathbf{P}\) is a permutation matrix and \(\mathbf{L} \in \mathbb{R}^{n \times r}\), \(r = \text{rank}(\mathbf{A})\).

Implementation

Example: positive definite matrix

using LinearAlgebra

A = Float64.([4 12 -16; 12 37 -43; -16 -43 98])  # check out this is pd!
3×3 Matrix{Float64}:
   4.0   12.0  -16.0
  12.0   37.0  -43.0
 -16.0  -43.0   98.0
# Cholesky without pivoting
Achol = cholesky(Symmetric(A))
Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0
typeof(Achol)
Cholesky{Float64, Matrix{Float64}}
fieldnames(typeof(Achol))
(:factors, :uplo, :info)
# retrieve the lower triangular Cholesky factor
Achol.L
3×3 LowerTriangular{Float64, Matrix{Float64}}:
  2.0   ⋅    ⋅ 
  6.0  1.0   ⋅ 
 -8.0  5.0  3.0
# retrieve the upper triangular Cholesky factor
Achol.U
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0
b = [1.0; 2.0; 3.0]
A \ b # this does LU; wasteful!; 2/3 n^3 + 2n^2
3-element Vector{Float64}:
 28.58333333333338
 -7.666666666666679
  1.3333333333333353
Achol \ b # two triangular solves; only 2n^2 flops
3-element Vector{Float64}:
 28.583333333333332
 -7.666666666666666
  1.3333333333333333
det(A) # this actually does LU; wasteful!
35.99999999999994
det(Achol) # cheap
36.0
inv(A) # this does LU!
3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111
inv(Achol) # 2n triangular solves
3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

Example: positive semidefinite matrix

using Random

Random.seed!(123) # seed
A = randn(5, 3)
A = A * transpose(A) # A has rank 3
5×5 Matrix{Float64}:
  0.726688  -0.982622  -1.0597   -0.58302    0.41867
 -0.982622   1.45895    1.96103   0.931141  -0.594176
 -1.0597     1.96103    4.07717   1.96485   -1.09925
 -0.58302    0.931141   1.96485   1.35893   -0.880852
  0.41867   -0.594176  -1.09925  -0.880852   0.607154
Achol = cholesky(A, Val(true)) # 2nd argument requests pivoting
LoadError: RankDeficientException(1)
Achol = cholesky(A, Val(true), check=false) # turn off checking pd
CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}
U factor with rank 4:
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.05367e-8
  ⋅       ⋅          ⋅           ⋅          -4.44089e-16
permutation:
5-element Vector{Int64}:
 3
 2
 4
 5
 1
rank(Achol) # determine rank from Cholesky factor
4
rank(A) # determine rank from SVD, which is more numerically stable
3
Achol.L
5×5 LowerTriangular{Float64, Matrix{Float64}}:
  2.0192      ⋅           ⋅         ⋅            ⋅ 
  0.971191   0.718149     ⋅         ⋅            ⋅ 
  0.973082  -0.0193672   0.641612   ⋅            ⋅ 
 -0.544399  -0.0911515  -0.549978  1.05367e-8    ⋅ 
 -0.524812  -0.658539   -0.132617  1.05367e-8  -4.44089e-16
Achol.U
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.05367e-8
  ⋅       ⋅          ⋅           ⋅          -4.44089e-16
Achol.p
5-element Vector{Int64}:
 3
 2
 4
 5
 1
Achol.P   # this returns P'
5×5 Matrix{Float64}:
 0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0
# P A P' = L U
norm(transpose(Achol.P) * A * Achol.P - Achol.L * Achol.U)
5.438959822042073e-16

Applications

  • No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.

Multivariate normal density

Multivariate normal density \(\mathcal{N}(0, \Sigma)\), where \(\Sigma\) is \(n\times n\) positive definite, is \[ \small \frac{1}{(2\pi)^{n/2}\det(\Sigma)^{1/2}}\exp\left(-\frac{1}{2}\mathbf{y}^T\Sigma^{-1}\mathbf{y}\right). \]

Hence the log likelihood is \[ \small - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}. \]

  • Method 1:
    1. compute explicit inverse \(\Sigma^{-1}\) (\(2n^3\) flops),
    2. compute quadratic form (\(2n^2 + 2n\) flops),
    3. compute determinant (\(2n^3/3\) flops).
  • Method 2:
    1. Cholesky decomposition \(\Sigma = \mathbf{L} \mathbf{L}^T\) (\(n^3/3\) flops),
    2. Solve \(\mathbf{L} \mathbf{x} = \mathbf{y}\) by forward substitutions (\(n^2\) flops),
    3. compute quadratic form \(\mathbf{x}^T \mathbf{x}\) (\(2n\) flops)
    4. compute determinant from Cholesky factor (\(n\) flops).

Which method is better?

# word-by-word transcription of mathematical expression
function logpdf_mvn_1(y::Vector, Σ::Matrix)
    n = length(y)
    - (n//2) * log(2π) - (1//2) * logdet(Σ) - (1//2) * transpose(y) * inv(Σ) * y
end

# you learnt why you should avoid inversion
function logpdf_mvn_2(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * dot(y, Σchol \ y)
end

# this is for the efficiency-concerned  
function logpdf_mvn_3(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - sum(log.(diag(Σchol.L))) - (1//2) * sum(abs2, Σchol.L \ y)
end
logpdf_mvn_3 (generic function with 1 method)
using BenchmarkTools, Distributions, Random

Random.seed!(123) # seed

n = 1000
# a pd matrix
Σ = convert(Matrix{Float64}, Symmetric([i * (n - j + 1) for i in 1:n, j in 1:n]))
y = rand(MvNormal(Σ)) # one random sample from N(0, Σ)

# at least they give same answer
@show logpdf_mvn_1(y, Σ)
@show logpdf_mvn_2(y, Σ)
@show logpdf_mvn_3(y, Σ);
logpdf_mvn_1(y, Σ) = -4856.961087775249
logpdf_mvn_2(y, Σ) = -4856.961087775317
logpdf_mvn_3(y, Σ) = -4856.961087775318
@benchmark logpdf_mvn_1($y, $Σ)
BenchmarkTools.Trial: 118 samples with 1 evaluation.
 Range (minmax):  33.045 ms74.940 ms   GC (min … max): 0.00% … 4.58%
 Time  (median):     37.421 ms               GC (median):    0.00%
 Time  (mean ± σ):   42.534 ms ±  9.247 ms   GC (mean ± σ):  2.39% ± 3.48%
  ▆▁ ▁█▂▃▁                      ▂                             
  ████████▃▆▃▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▃▄▆█▄██▃▄▇▃▆▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▃▁▃ ▃
  33 ms           Histogram: frequency by time        66.1 ms <
 Memory estimate: 15.77 MiB, allocs estimate: 9.
@benchmark logpdf_mvn_2($y, $Σ)
BenchmarkTools.Trial: 359 samples with 1 evaluation.
 Range (minmax):  10.062 ms37.556 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     11.814 ms               GC (median):    0.00%
 Time  (mean ± σ):   14.179 ms ±  4.300 ms   GC (mean ± σ):  3.44% ± 7.04%
   ▂▃█▂▁                                                       
  ▃█████▃▂▃▄▄▄▄▅▃▃▂▃▄▁▃▃▃▄▄▃▃▃▁▅▃▄▃▃▃▃▃▁▂▁▁▃▂▃▃▃▂▁▂▁▂▂▂▃▂▃▃ ▃
  10.1 ms         Histogram: frequency by time        26.1 ms <
 Memory estimate: 7.64 MiB, allocs estimate: 3.
@benchmark logpdf_mvn_3($y, $Σ)
BenchmarkTools.Trial: 314 samples with 1 evaluation.
 Range (minmax):  13.647 ms98.125 ms   GC (min … max): 0.00% … 3.41%
 Time  (median):     15.424 ms               GC (median):    8.38%
 Time  (mean ± σ):   15.925 ms ±  6.336 ms   GC (mean ± σ):  8.70% ± 8.56%
  ▇▄▆▃    ▃▇█                                                 
  ████▆▃▄▃███▄▄▃▁▂▂▃▄▂▂▄▃▁▁▂▂▁▂▁▁▂▁▁▂▁▁▁▁▁▃▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▂ ▃
  13.6 ms         Histogram: frequency by time        25.3 ms <
 Memory estimate: 22.91 MiB, allocs estimate: 11.
  • To evaluate same multivariate normal density at many observations \(y_1, y_2, \ldots\), we pre-compute the Cholesky decomposition (\(n^3/3\) flops), then each evaluation costs \(n^2\) flops.

Linear regression by Cholesky

  • Cholesky decomposition is one approach to solve linear regression. Assume \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and \(\mathbf{y} \in \mathbb{R}^n\).
    • Compute \(\mathbf{X}^T \mathbf{X}\): \(np^2\) flops
    • Compute \(\mathbf{X}^T \mathbf{y}\): \(2np\) flops
    • Cholesky decomposition of \(\mathbf{X}^T \mathbf{X}\): \(\frac{1}{3} p^3\) flops
    • Solve normal equation \(\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}\): \(2p^2\) flops
    • If need standard errors, another \((4/3)p^3\) flops
  • Total computational cost is \(np^2 + (1/3) p^3\) (without s.e.) or \(np^2 + (5/3) p^3\) (with s.e.) flops.

Further reading