singular value decompositions \(\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\)
iterative methods for numerical linear algebra
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
Our major goal (or learning objectives) is to
know the complexity (flop count) of each task
be familiar with the BLAS and LAPACK functions (what they do)
do not re-invent the wheel by implementing these dense linear algebra subroutines by yourself
understand the need for iterative methods
apply appropriate numerical algebra tools to various statistical problems
All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.
Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.
BLAS
BLAS stands for basic linear algebra subroutines.
See netlib for a complete list of standardized BLAS functions.
Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).
Examples
The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
– James Gentle, Matrix Algebra, Springer, New York (2007).
Some operations appear as level-3 but indeed are level-2.
Example 1
A common operation in statistics is column scaling or row scaling \[
\begin{eqnarray*}
\mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\
\mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)},
\end{eqnarray*}
\] where \(\mathbf{D}\) is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form \[
\mathbf{X}^T \mathbf{W} \mathbf{X},
\] where \(\mathbf{W}\) is a diagonal matrix with observation weights on diagonal.
Column and row scalings are essentially level-2 operations!
usingBenchmarkTools, LinearAlgebra, RandomRandom.seed!(123) # seedn =2000A =rand(n, n) # n-by-n matrixd =rand(n) # n vectorD =Diagonal(d) # diagonal matrix with d as diagonal
They appear as level-3 operation (matrix multiplication with \(O(m^2n)\) or \(O(mn^2)\) flops).
Random.seed!(123)n =2000A, B =randn(n, n), randn(n, n)# slow way to evaluate this thing@benchmarktr(transpose($A) *$B)
BenchmarkTools.Trial: 42 samples with 1 evaluation.
Range (min … max): 104.649 ms … 145.683 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 118.936 ms ┊ GC (median): 0.00%
Time (mean ± σ): 121.451 ms ± 9.916 ms┊ GC (mean ± σ): 2.33% ± 3.18%
▃▃█ ▃▃ ▃ ▃ ▃
▇▁▁▁▇▁▁▇▇▇▁███▇██▁█▇▇▇▁▇▇▇▇▁█▁▇▁▇█▁▇▁▁▇▁▇▇▇▁▁▁▁▁▇▇▇▁▁▁▁▁▇▁▁▁▇ ▁
105 ms Histogram: frequency by time 146 ms <
Memory estimate: 30.52 MiB, allocs estimate: 2.
But \(\langle \mathbf{A}, \mathbf{B} \rangle = \langle \text{vec}(\mathbf{A}), \text{vec}(\mathbf{B}) \rangle\). The latter is level-2 operation with \(O(mn)\) flops.
@benchmarkdot($A, $B)
BenchmarkTools.Trial: 3834 samples with 1 evaluation.
Range (min … max): 1.215 ms … 10.489 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.248 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.292 ms ± 374.393 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▅██▇▆▆▆▆▅▅▄▄▄▄▃▂▂▂▂ ▂
██████████████████████▇█▇▇▇▆▅▇▇▄▆▅▆▅▇▆▄▆▇▆███▆▆▆▄▄▄▆▄▆▆▄▄▃▅ █
1.21 msHistogram: log(frequency) by time 1.57 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Example 3
Similarly \(\text{diag}(\mathbf{A}^T \mathbf{B})=\text{diag}(\sum_{j}\mathbf{A}_{ij}\mathbf{B}_{ij})\) can be calculated in \(O(mn)\) flops.
# slow way to evaluate this thing@benchmarkdiag(transpose($A) *$B)
BenchmarkTools.Trial: 38 samples with 1 evaluation.
Range (min … max): 105.460 ms … 176.225 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 131.428 ms ┊ GC (median): 0.00%
Time (mean ± σ): 135.400 ms ± 20.099 ms┊ GC (mean ± σ): 2.05% ± 2.95%
█ ▄ ▁ ▁ ▁ ▁ ▁ ▁
▆▁▁▁▆█▆▁▆█▁▁▆█▆▁█▆▁▆▁▁▁▁▁▆▁▁▁▁▁▆▁▆▆█▁█▁▆█▆▁▁▆▁█▁▆▁▁▁▁▆▆▁▁▁▁▁▆ ▁
105 ms Histogram: frequency by time 176 ms <
Memory estimate: 30.53 MiB, allocs estimate: 3.
BenchmarkTools.Trial: 426 samples with 1 evaluation.
Range (min … max): 7.351 ms … 27.917 ms┊ GC (min … max): 0.00% … 30.06%
Time (median): 8.809 ms ┊ GC (median): 0.00%
Time (mean ± σ): 11.720 ms ± 4.593 ms┊ GC (mean ± σ): 23.23% ± 21.90%
▂▅▆█▆▄▁ ▃▁ ▄▄▃▂
██████████▇▅▄▄▁▄▁▁▁▁▁▁▁▁▁▄▁▁▆███████▇█▇▅▅▁▄▁▁▄▁▁▁▁▅▁▁▁▁▁▄▄▄ ▇
7.35 msHistogram: log(frequency) by time 24.9 ms <
Memory estimate: 30.53 MiB, allocs estimate: 5.
To get rid of allocation of intermediate array at all, we can just write a double loop or use dot function.
functiondiag_matmul(A, B) m, n =size(A)@assertsize(B) == (m, n) "A and B should have same size"# @views: Convert every array-slicing operation in the given expression to return a view.# See below for avoiding memory allocation@views d = [dot(A[:, j], B[:, j]) for j in1:n]# d = zeros(eltype(A), n)# for j in 1:n, i in 1:m# d[j] += A[i, j] * B[i, j]# endDiagonal(d)end@benchmarkdiag_matmul($A, $B)
BenchmarkTools.Trial: 1972 samples with 1 evaluation.
Range (min … max): 2.363 ms … 11.718 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.442 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.526 ms ± 604.918 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▇▆▄▃▂ ▁ ▁▁ ▁
███████████████▇▆▆▅▆▆▁▁▄▅▁▁▁▁▄▅▁▁▄▄▁▁▄▅▄▁▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▄▁▄ █
2.36 msHistogram: log(frequency) by time 4.01 ms <
Memory estimate: 15.75 KiB, allocs estimate: 1.
# innermost loop for k =1:p C[i, j] = C[i, j] + A[i, k] * B[k, j]end
We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant
A Stride
B Stride
C Stride
jki or kji
Unit
0
Unit
ikj or kij
0
Non-Unit
Non-Unit
ijk or jik
Non-Unit
Unit
0
Apparently the variants jki or kji are preferred.
Let’s test.
""" matmul_by_loop!(A, B, C, order)Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop."""functionmatmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String) m =size(A, 1) p =size(A, 2) n =size(B, 2)fill!(C, 0)if order =="jki"for j =1:n, k =1:p, i =1:m C[i, j] += A[i, k] * B[k, j]endendif order =="kji"for k =1:p, j =1:n, i =1:m C[i, j] += A[i, k] * B[k, j]endendif order =="ikj"for i =1:m, k =1:p, j =1:n C[i, j] += A[i, k] * B[k, j]endendif order =="kij"for k =1:p, i =1:m, j =1:n C[i, j] += A[i, k] * B[k, j]endendif order =="ijk"for i =1:m, j =1:n, k =1:p C[i, j] += A[i, k] * B[k, j]endendif order =="jik"for j =1:n, i =1:m, k =1:p C[i, j] += A[i, k] * B[k, j]endendendusingRandomRandom.seed!(123)m, p, n =2000, 100, 2000A =rand(m, p)B =rand(p, n)C =zeros(m, n);
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 264.654 ms … 290.569 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 275.690 ms ┊ GC (median): 0.00%
Time (mean ± σ): 276.962 ms ± 7.307 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁▁█▁▁ ▁ ▁▁ ▁ ▁ █ ▁ ▁ ▁
█▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁█████▁▁█▁▁▁▁▁▁██▁█▁█▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁█ ▁
265 ms Histogram: frequency by time 291 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "kji")
BenchmarkTools.Trial: 16 samples with 1 evaluation.
Range (min … max): 315.432 ms … 340.858 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 324.886 ms ┊ GC (median): 0.00%
Time (mean ± σ): 325.659 ms ± 5.261 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ █▁▁▁▁ ▁▁█ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁█▁█████▁▁███▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
315 ms Histogram: frequency by time 341 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
ikj and kij looping:
@benchmarkmatmul_by_loop!($A, $B, $C, "ikj")
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 2.321 s … 2.350 s┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.342 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.338 s ± 14.481 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.32 s Histogram: frequency by time 2.35 s <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "kij")
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 2.369 s … 2.393 s┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.372 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.378 s ± 13.230 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.37 s Histogram: frequency by time 2.39 s <
Memory estimate: 0 bytes, allocs estimate: 0.
ijk and jik looping:
@benchmarkmatmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 459.863 ms … 488.570 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 468.752 ms ┊ GC (median): 0.00%
Time (mean ± σ): 470.201 ms ± 9.044 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ █ ▁ ▁ ▁ ▁ ▁ ▁ ▁
█▁█▁▁▁▁▁▁▁█▁█▁▁▁▁▁█▁█▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█ ▁
460 ms Histogram: frequency by time 489 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 458.816 ms … 490.466 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 473.290 ms ┊ GC (median): 0.00%
Time (mean ± σ): 474.217 ms ± 10.033 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ █ ▁ ▁ ▁▁ ▁
█▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█ ▁
459 ms Histogram: frequency by time 490 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Julia wraps BLAS library for matrix multiplication. We see BLAS library wins the jki and kji looping by a large margin (multi-threading, optimized cache uses, etc).
@benchmarkmul!($C, $A, $B)
BenchmarkTools.Trial: 746 samples with 1 evaluation.
Range (min … max): 5.530 ms … 18.178 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.243 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.685 ms ± 1.465 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂▇██
█████▇▇▆▆▇▆▆▅▄▃▃▃▃▃▃▁▂▃▂▃▂▁▂▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▂▁▁▂ ▃
5.53 ms Histogram: frequency by time 14.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# direct call of BLAS wrapper function@benchmarkLinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
BenchmarkTools.Trial: 795 samples with 1 evaluation.
Range (min … max): 5.369 ms … 18.630 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.008 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.275 ms ± 1.352 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█▂
▇▅▇██▅▄▃▃▂▃▃▃▂▂▂▂▂▁▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
5.37 ms Histogram: frequency by time 15.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BLAS in R
Tip for R users. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 75.228 μs … 6.008 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 93.243 μs ┊ GC (median): 0.00%
Time (mean ± σ): 99.234 μs ± 85.271 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▆▆▇█▆▅▂▁
▁▁▂▂▃▅▆▇██████████▇▇▆▅▄▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▃
75.2 μs Histogram: frequency by time 149 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Broadcasting in Julia achieves vectorized code without creating intermediate arrays.
Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R, the command
max(abs(X), abs(Y))
will create two intermediate arrays and then one result array.
usingRCallRandom.seed!(123)X, Y =rand(1000, 1000), rand(1000, 1000)R"""library(microbenchmark)microbenchmark(max(abs($X), abs($Y)))"""
RObject{VecSxp}
Unit: milliseconds
expr min lq mean median uq
max(abs(`#JL`$X), abs(`#JL`$Y)) 5.246855 5.861161 10.09725 7.250268 8.687335
max neval
91.96797 100
In Julia, dot operations are fused so no intermediate arrays are created.
# no intermediate arrays created, only result array created@benchmarkmax.(abs.($X), abs.($Y))
BenchmarkTools.Trial: 2988 samples with 1 evaluation.
Range (min … max): 769.503 μs … 17.001 ms┊ GC (min … max): 0.00% … 88.54%
Time (median): 1.190 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.666 ms ± 1.394 ms┊ GC (mean ± σ): 27.93% ± 24.88%
▄▂▁▂█▆▅▃▂▁ ▁▂▂▁▁ ▁
███████████▇▆▅▅▄▆▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▄▆██████▇▆▇▆▆▆▆ █
770 μsHistogram: log(frequency) by time 5.94 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
# Pre-allocating result array gets rid of memory allocation at all.Z =zeros(size(X)) # zero matrix of same size as X@benchmark$Z .=max.(abs.($X), abs.($Y)) # .= (vs =) is important!
BenchmarkTools.Trial: 6270 samples with 1 evaluation.
Range (min … max): 696.965 μs … 7.058 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 748.761 μs ┊ GC (median): 0.00%
Time (mean ± σ): 789.471 μs ± 239.083 μs┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▃▁ ▂
████████▆▅▅▅▄▅▄▄▄▄▃▃▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
697 μs Histogram: frequency by time 1.19 ms <
Memory estimate: 0 bytes, allocs estimate: 0.