Numerical Algorithms

Advanced Statistical Computing

Joong-Ho Won

Seoul National University

October 2023

versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 2 on 8 virtual cores

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

Status `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/04-algo/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2

Algorithms

Algorithm is loosely defined as a set of instructions for doing something, which terminates in finite time. An algorithm requires input and output.
Donald Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

Measure of efficiency

A basic unit for measuring algorithmic efficiency is flop.
> A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, usually accompanying fetch and store.

Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.

How to measure efficiency of an algorithm? Big O notation. If \(n\) is the size of a problem, an algorithm has order \(O(f(n))\), where the leading term in the number of flops is \(c \cdot f(n)\). For example,
- matrix-vector multiplication A * b, where A is \(m \times n\) and b is \(n \times 1\), takes \(2mn\) or \(O(mn)\) flops
- matrix-matrix multiplication A * B, where A is \(m \times n\) and B is \(n \times p\), takes \(2mnp\) or \(O(mnp)\) flops

A hierarchy of computational complexity:
Let \(n\) be the problem size.
- Exponential order: \(O(b^n)\) (“horrible”)
- Polynomial order: \(O(n^q)\) (doable)
- \(O(n \log n )\) (fast)
- Linear order \(O(n)\) (fast)
- Logarithmic order \(O(\log n)\) (super fast)

Classification of data sets by Huber (1994).

Data Size	Bytes	Storage Mode
Tiny	\(10^2\)	Piece of paper
Small	\(10^4\)	A few pieces of paper
Medium	\(10^6\) (megabytes)	A floppy disk
Large	\(10^8\)	Hard disk
Huge	\(10^9\) (gigabytes)	Hard disk(s)
Massive	\(10^{12}\) (terabytes)	RAID storage

Difference of \(O(n^2)\) and \(O(n\log n)\) on massive data. Suppose we have a teraflops supercomputer capable of doing \(10^{12}\) flops per second. For a problem of size \(n=10^{12}\), \(O(n \log n)\) algorithm takes about \[10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.\] \(O(n^2)\) algorithm takes about \(10^{24}/10^{12} = 10^{12}\) seconds, which is approximately 31710 years!

QuickSort and Fast Fourier Transform (invented by John Tukey) are celebrated algorithms that turn \(O(n^2)\) operations into \(O(n \log n)\). Another example is Strassen’s method for matrix multiplication, which turns \(O(n^3)\) matrix multiplication into \(O(n^{\log_2 7})\).
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.
> The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
– James Gentle, Matrix Algebra, Springer, New York (2007).

For example, compare flops of the two mathematically equivalent expressions: (A * B) * x and A * (B * x) where A and B are matrices and x is a vector.

using BenchmarkTools, Random

Random.seed!(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

# complexity is n^3 + n^2 = O(n^3)
@benchmark ($A * $B) * $x

BenchmarkTools.Trial: 264 samples with 1 evaluation.
 Range (min … max):  14.006 ms … 27.009 ms  ┊ GC (min … max): 0.00% … 14.02%
 Time  (median):     19.228 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.943 ms ±  2.760 ms  ┊ GC (mean ± σ):  3.12% ±  6.24%
        ▃ █▂▆ ▂          ▂▂ ▆▄▄▄▄ ▃ ▄  ▂   ▂                   
  ▃▄▅▆▄██▇██████▇▅▃▅▅▆▇▃▇████████▅█▆█▇▆█▇▃▆█▄▄▆▇▃▄▃▃▅▄▁▄▁▁▁▁▃ ▄
  14 ms           Histogram: frequency by time        25.7 ms <
 Memory estimate: 7.64 MiB, allocs estimate: 3.

# complexity is n^2 + n^2 = O(n^2)
@benchmark $A * ($B * $x)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  282.345 μs …  4.624 ms  ┊ GC (min … max): 0.00% … 91.03%
 Time  (median):     336.286 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   355.533 μs ± 77.099 μs  ┊ GC (mean ± σ):  0.12% ±  0.91%
   ▆█▂                                                          
  ▃███▆▇▆▅▄▄▃▃▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  282 μs          Histogram: frequency by time          515 μs <
 Memory estimate: 15.88 KiB, allocs estimate: 2.

Why are there the difference?

Performance of computer systems

FLOPS.

For example, my laptop has the Intel i5-8279U (Coffee Lake) CPU with 4 cores runing at 2.4 GHz (cycles per second).
- Intel Coffee Lake CPUs can do 16 double-precision flops per cycle and 32 single-precision flops per cycle. Then the theoretical throughput of my laptop is \[ 4 \times 2.4 \times 10^9 \times 16 = 153.6 \text{ GFLOPS} \] in double precision and \[4 \times 2.4 \times 10^9 \times 32 = 307.2 \text{ GFLOPS} \] in single precision.

In Julia, LinearAlgebra.peakflops() computes the peak flop rate of the computer by using gemm! (double precision matrix-matrix multiplication).

using LinearAlgebra

LinearAlgebra.peakflops(2^14) # matrix size 2^14

1.0529135405714091e11

which is about 110 GFLOPS DP.

Stability of numerical algorithms

Recall that abstractly, a problem can be viewed as function \(f: \mathcal{X} \to \mathcal{Y}\) where \(\mathcal{X}\) is a normed vector space of data and \(\mathcal{Y}\) is a normed vector space of solutions.
For given data \(x \in \mathcal{X}\), the true solution of the problem \(f\) is \(y = f(x) \in \mathcal{Y}\).
- The problem of solving \(Ax=b\) for fixed \(b\) is \(f: A \mapsto A^{-1}b\) with \(\mathcal{X}=\{M\in\mathbb{R}^{n\times n}: M \text{ is invertible} \}\) and \(\mathcal{Y} = \mathbb{R}^n\).

An algorithm can be viewed as another map \(\tilde{f}: \mathcal{X} \to \mathcal{Y}\).
For given data \(x \in \mathcal{X}\), the solution computed by algorithm \(\tilde{f}\) is \(\hat{y} = \tilde{f}(x) \in \mathcal{Y}\).
- Example 1: solve \(Ax=b\) by GEPP followed by forward and backward substitutions on a digital computer.
- Example 2: solve \(Ax=b\) by Gauss-Seidel (an iterative method to come) on a digital computer.
- In both cases, the solutions (in \(\mathcal{Y}\)) are not the same as \(A^{-1}b\)!
- Algorithms will be affected by at least rounding errors.

It is not necessarily true that \(\hat{y} = y\), or \(\tilde{f}(x) = f(x)\). The forward error of a computed solution is the relative error \[ {\Vert \tilde{f}(x) - f(x) \Vert}/{\Vert f(x) \Vert} . \]
Algorithm \(\tilde{f}\) is said (forward) stable if \[ \forall x \in \mathcal{X}, \exists \tilde{x} \in \mathcal{X} \text{ such that } \frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon) \] implies \[ \frac{\|\tilde{f}(x) - f(\tilde{x})\|}{\|f(\tilde{x})\|} = O(\epsilon) , \quad \text{as}~ \epsilon \to 0 . \] In words, a stable algorithm gives “nearly the right” answer to a “slightly wrong” question.

Backward stability: algorithm \(\tilde{f}\) is said backward stable if \[ \forall x \in \mathcal{X}, \exists \tilde{x} \in \mathcal{X} \text{ such that } \frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon) \] implies \[ \tilde{f}(x) = f(\tilde{x}) , \quad \text{as}~ \epsilon \to 0 \] In words, a backward stable algorithm gives “exactly the right” answer to a “slightly wrong” question.
- Backward stability implies stability, but not the other way around.

If a backward stable algorithm \(\tilde{f}\) is applied to solve a problem \(f\), the forward error of \(\tilde{f}\) is bounded by the condition number of problem \(f\).
To see this, recall the definition of the condition number \[ \kappa = \lim_{\delta\to 0}\sup_{\|\tilde{x} - x\|\le \delta \Vert x \Vert}\frac{\|f(\tilde{x}) - f(x)\|/\|f(x)\|}{\|\tilde{x} - x\|/\|x\|} . \] Thus for \(\tilde{x} \in \mathcal{X}\) such that \(\frac{\Vert\tilde{x} - x\Vert}{\Vert x \Vert} = O(\epsilon)\) and \(\tilde{f}(x) = f(\tilde{x})\) as \(\epsilon \to 0\), we have \[ \frac{\|\tilde{f}(x) - f(x)\|}{\|f(x)\|} \le ( \kappa + o(1) )\frac{\|\tilde{x} - x\|}{\|x\|} = O(\kappa \epsilon) . \]

Examples
- Computing the inner product \(x^Ty\) of vectors \(x\) snd \(y\) using by \([\sum_{i=1}^n [[x_i][y_i]]]\) (in IEEE754) is backward stable.
- Computing the outer product \(A=xy^T\) of vectors \(x\) snd \(y\) using by \(A_{ij}=[[x_i][y_i]]\) (in IEEE754) is stable but not backward stable.
(Backward) Stability a property of an algorithm, whereas conditioning is a property of a problem.

Reading

What is Numerical Stability? by Nick Higham