A basic unit for measuring algorithmic efficiency is flop.
> A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, usually accompanying fetch and store.
Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.
How to measure efficiency of an algorithm? Big O notation. If \(n\) is the size of a problem, an algorithm has order \(O(f(n))\), where the leading term in the number of flops is \(c \cdot f(n)\). For example,
matrix-vector multiplication A * b, where A is \(m \times n\) and b is \(n \times 1\), takes \(2mn\) or \(O(mn)\) flops
matrix-matrix multiplication A * B, where A is \(m \times n\) and B is \(n \times p\), takes \(2mnp\) or \(O(mnp)\) flops
A hierarchy of computational complexity:
Let \(n\) be the problem size.
Difference of \(O(n^2)\) and \(O(n\log n)\) on massive data. Suppose we have a teraflops supercomputer capable of doing \(10^{12}\) flops per second. For a problem of size \(n=10^{12}\), \(O(n \log n)\) algorithm takes about \[10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.\]\(O(n^2)\) algorithm takes about \(10^{24}/10^{12} = 10^{12}\) seconds, which is approximately 31710 years!
QuickSort and Fast Fourier Transform (invented by John Tukey) are celebrated algorithms that turn \(O(n^2)\) operations into \(O(n \log n)\). Another example is Strassen’s method for matrix multiplication, which turns \(O(n^3)\) matrix multiplication into \(O(n^{\log_2 7})\).
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.
> The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
– James Gentle, Matrix Algebra, Springer, New York (2007).
For example, compare flops of the two mathematically equivalent expressions: (A * B) * x and A * (B * x) where A and B are matrices and x is a vector.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 282.345 μs … 4.624 ms┊ GC (min … max): 0.00% … 91.03%
Time (median): 336.286 μs ┊ GC (median): 0.00%
Time (mean ± σ): 355.533 μs ± 77.099 μs┊ GC (mean ± σ): 0.12% ± 0.91%
▆█▂
▃███▆▇▆▅▄▄▃▃▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▂
282 μs Histogram: frequency by time 515 μs <
Memory estimate: 15.88 KiB, allocs estimate: 2.
Why are there the difference?
Performance of computer systems
FLOPS.
For example, my laptop has the Intel i5-8279U (Coffee Lake) CPU with 4 cores runing at 2.4 GHz (cycles per second).
Intel Coffee Lake CPUs can do 16 double-precision flops per cycle and 32 single-precision flops per cycle. Then the theoretical throughput of my laptop is \[
4 \times 2.4 \times 10^9 \times 16 = 153.6 \text{ GFLOPS}
\] in double precision and \[4 \times 2.4 \times 10^9 \times 32 = 307.2 \text{ GFLOPS}
\] in single precision.
In Julia, LinearAlgebra.peakflops() computes the peak flop rate of the computer by using gemm! (double precision matrix-matrix multiplication).
Recall that abstractly, a problem can be viewed as function \(f: \mathcal{X} \to \mathcal{Y}\) where \(\mathcal{X}\) is a normed vector space of data and \(\mathcal{Y}\) is a normed vector space of solutions.
For given data \(x \in \mathcal{X}\), the true solution of the problem \(f\) is \(y = f(x) \in \mathcal{Y}\).
The problem of solving \(Ax=b\) for fixed \(b\) is \(f: A \mapsto A^{-1}b\) with \(\mathcal{X}=\{M\in\mathbb{R}^{n\times n}: M \text{ is invertible} \}\) and \(\mathcal{Y} = \mathbb{R}^n\).
An algorithm can be viewed as another map \(\tilde{f}: \mathcal{X} \to \mathcal{Y}\).
For given data \(x \in \mathcal{X}\), the solution computed by algorithm \(\tilde{f}\) is \(\hat{y} = \tilde{f}(x) \in \mathcal{Y}\).
Example 1: solve \(Ax=b\) by GEPP followed by forward and backward substitutions on a digital computer.
Example 2: solve \(Ax=b\) by Gauss-Seidel (an iterative method to come) on a digital computer.
In both cases, the solutions (in \(\mathcal{Y}\)) are not the same as \(A^{-1}b\)!
Algorithms will be affected by at least rounding errors.
It is not necessarily true that \(\hat{y} = y\), or \(\tilde{f}(x) = f(x)\). The forward error of a computed solution is the relative error \[
{\Vert \tilde{f}(x) - f(x) \Vert}/{\Vert f(x) \Vert}
.
\]
Algorithm \(\tilde{f}\) is said (forward) stable if \[
\forall x \in \mathcal{X},
\exists \tilde{x} \in \mathcal{X} \text{ such that }
\frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon)
\] implies \[
\frac{\|\tilde{f}(x) - f(\tilde{x})\|}{\|f(\tilde{x})\|} = O(\epsilon)
,
\quad
\text{as}~ \epsilon \to 0
.
\] In words, a stable algorithm gives “nearly the right” answer to a “slightly wrong” question.
Backward stability: algorithm \(\tilde{f}\) is said backward stable if \[
\forall x \in \mathcal{X},
\exists \tilde{x} \in \mathcal{X} \text{ such that }
\frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon)
\] implies \[
\tilde{f}(x) = f(\tilde{x})
,
\quad
\text{as}~ \epsilon \to 0
\] In words, a backward stable algorithm gives “exactly the right” answer to a “slightly wrong” question.
Backward stability implies stability, but not the other way around.
If a backward stable algorithm \(\tilde{f}\) is applied to solve a problem \(f\), the forward error of \(\tilde{f}\) is bounded by the condition number of problem \(f\).
To see this, recall the definition of the condition number \[
\kappa = \lim_{\delta\to 0}\sup_{\|\tilde{x} - x\|\le \delta \Vert x \Vert}\frac{\|f(\tilde{x}) - f(x)\|/\|f(x)\|}{\|\tilde{x} - x\|/\|x\|}
.
\] Thus for \(\tilde{x} \in \mathcal{X}\) such that \(\frac{\Vert\tilde{x} - x\Vert}{\Vert x \Vert} = O(\epsilon)\) and \(\tilde{f}(x) = f(\tilde{x})\) as \(\epsilon \to 0\), we have \[
\frac{\|\tilde{f}(x) - f(x)\|}{\|f(x)\|} \le ( \kappa + o(1) )\frac{\|\tilde{x} - x\|}{\|x\|}
= O(\kappa \epsilon)
.
\]
Examples
Computing the inner product \(x^Ty\) of vectors \(x\) snd \(y\) using by \([\sum_{i=1}^n [[x_i][y_i]]]\) (in IEEE754) is backward stable.
Computing the outer product \(A=xy^T\) of vectors \(x\) snd \(y\) using by \(A_{ij}=[[x_i][y_i]]\) (in IEEE754) is stable but not backward stable.
(Backward) Stability a property of an algorithm, whereas conditioning is a property of a problem.