First bit (“most significant bit” or MSB) is the sign bit.
0 for nonnegative numbers
1 for negative numbers
Two’s complement representation for negative numbers
Set the sign bit to 1
Negate (0->1, 1->0) the remaining bits
Add to 1 to the result
Two’s complement representation of a negative integer \(x\) is the same as the unsigned integer \(2^M - x\).
@showtypeof(5)@showbitstring(5)@showbitstring(-5)@showbitstring(UInt64(Int128(2)^64-5)) ==bitstring(-5)@showbitstring(2*5) # shift bits of 5 to the left@showbitstring(2*-5); # shift bits of -5 to left
Two’s complement representation respects modular arithmetic nicely. Addition of any two signed integers are just bitwise addition, possibly modulo \(2^M\)
In the scientific notation, a real number is represented as \[
\pm d_1.d_2d_3 \cdots d_p \times b^e, \quad 0 \le d_i < b.
\] Humans use the base\(b=10\) and digits\(d_i=0, 1, \dotsc, 9\).
In computer, the base is \(b=2\) and the digits \(d_i\) are 0 or 1. The exponent\(e\) is between \(e_{\min}\) and \(e_{\max}\).
Normalized vs denormalized numbers. For example, decimal number 18 is \[ +1.0010 \times 2^4 \quad (\text{normalized})\] or, equivalently, \[ +0.1001 \times 2^5 \quad (\text{denormalized}).\]
In the floating-number system, computer stores
sign bit
the fraction (or mantissa, or significand) of the normalized representation (except some special cases)
the actual exponent plus a bias
Because \(e_{\min}\) is likely negative and \(e_{\max}\) is positive, the exponent requires a sign. IEEE 754 handles the sign of the exponent by subtracting the bias from the value of the exponent evaluated as an unsigned integer.
Julia provides Float16 (half precision, implemented in software), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).
Exponent \(e_{\min}-1\) with a nonzero mantissa are for numbers less than \(b^{e_{\min}}\) (subnormal numbers).
Numbers are denormalized in the range \((0,b^{e_{\min}})\) – gradual underflow.
For example, in half-precision, \(e_{\min}=-14\) but \(2^{-24}\) is represented by \(0.0000000001_2 \times 2^{-14}\).
If floating point numbers are all normalized, the spacing between 0 and \(b^{e_{\min}}\) is just \(b^{e_{\min}}\), whereas the spacing between \(b^{e_{\min}}\) and \(b^{e_{\min}+1}\) is \(b^{e_{\min}-p+1}\). With subnormal numbers, the spacing between 0 and \(b^{e_{\min}}\) can be \(b^{e_{\min}-p+1}\), which is consistent with the spacing just above \(b^{e_{\min}}\).
Rounding is necessary whenever a number has more than \(p\) significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). Julia offers several rounding modes, the default being RoundNearest.
“Round to nearest, ties to even” rule: rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit (i.e., zero; default for IEEE 754 binary floating point numbers)
For example, the number 1/10 cannot be represented accurately as a (binary) floating point number: \[ 0.1 = 1.10011001\dotsc_2 \times 2^{-4} \]
@showbitstring(0.1); # double precision Float64@showbitstring(0.1f0); # single precision Float32, 1001 gets rounded to 101(0)@showbitstring(Float16(0.1)); # half precision Float16, 1001 gets rounded to 101(0)
Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number \(x\) is represented by a floating point number \([x]\), then
Same number of representible numbers in \((2^i, 2^{i+1}]\) as in \((2^{i+1}, 2^{i+2}]\). Within an interval, they are uniformly distributed.
Machine epsilons are the spacings of numbers around 1:
\(\epsilon_{\max}\) = (smallest positive floating point number that added to 1 will give a result different from 1) = \(\frac{1}{2^p} + \frac{1}{2^{2p-1}}\)
\(\epsilon_{\min}\) = (smallest positive floating point number that subtracted from 1 will give a result different from 1) = \(\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}\).
That is, \(1 + \epsilon_{\max}\) is the number in the “midway” between 1 and the floating point number right next to it, etc.
Source: Computational Statistics, James Gentle, Springer, New York, 2009.
Caution: the definition of \(\epsilon_{\max}\) and \(\epsilon_{\min}\) in this book is different from the lecture note.
Any real number in the interval \(\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]=[1.111\dotsb1|1_2 \times 2^{-1}, 1.000\dotsb 0|1 \times 2^0]\) is represented by a floating point number \(1 = 1.00\dotsc 0_2 \times 2^0\) (assuming the “ties to even” rule: consider \(p=2\) case).
Adding \(\frac{1}{2^p}\) to 1 won’t reach the next representable floating point number \(1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}\). Hence \(\epsilon_{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}\).
Adding the floating point number next to\(\frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}\) to 1 WILL result in \(1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}\), hence \(\epsilon_{\max} = 1.00\dotsb 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}\).
Subtracting \(\frac{1}{2^{p+1}}\) from 1 results in \(1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}\), which is represented by the floating point number \(1.00\dotsb 0_2 \times 2^{0} = 1\) by the “ties to even” rule. Hence \(\epsilon_{\min} > \frac{1}{2^{p+1}}\).
The smallest positive floating point number larger than \(\frac{1}{2^{p+1}}\) is \(\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 1_2 \times 2^{-p-1}\). Thus \(\epsilon_{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}\).
Machine precision
Machine epsilon is often called the machine precision.
If a positive real number \(x \in \mathbb{R}\) is represented by \([x]\) in the floating point arithmetic, then \[
[x] = \left( 1 + \sum_{i=1}^{p-1}\frac{d_{i+1}}{2^i}\right) \times 2^e.
\]
Thus \(x - \frac{2^e}{2^p} \le [x] \le x + \frac{2^e}{2^p}\), and \[
\begin{split}
\frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\
&\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\
&\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\
&= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}.
\end{split}
\] That is, the relative error of the floating point representation \([x]\) of real number \(x\) is bounded by \(\epsilon_{\max}\).
@show2^(-53) +2^(-105); # epsilon_max for Float64@show1.0+2^(-53);@show1.0+ (2^(-53) +2^(-105));@show1.0+2^(-53) +2^(-105); # why is the result? See "Catastrophic cancellation"@showFloat32(2^(-24) +2^(-47)); # epsilon_max for Float32@show1.0f0+Float32(2^(-24));@show1.0f0+Float32(2^(-24) +2^(-47));
In Julia, eps(x) gives the distance between consecutive representable floating point values at x, i.e.,
eps(x) ==max(x-prevfloat(x), nextfloat(x)-x)
which is roughly twice the \(\epsilon_{\max}\).
@showeps(Float32) # machine epsilon for a floating point type, roughly twice our \epsilon_{\max}@showeps(Float64) # same as eps()# eps(x) is the spacing after x@showeps(100.0)@showeps(0.0)# nextfloat(x) and prevfloat(x) give the neighbors of x@show x =1.25f0@showprevfloat(x), x, nextfloat(x)@showbitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x));
In R, the variable .Machine contains numerical characteristics of the machine. double.eps and double.neg.eps are roughly twice our \(\epsilon_{\max}\) and \(\epsilon_{\min}\), respectively.
For double precision, the range is \(\pm 10^{\pm 308}\). In most situations, underflow (magnitude of result is less than \(10^{-308}\)) is preferred over overflow (magnitude of result is larger than \(10^{+308}\)). Overflow produces \(\pm \inf\). Underflow yields zeros or subnormal numbers.
Example: the logit link function is \[p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.\] The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to gradual underflow.
floatmin and floatmax functions gives largest and smallest non-subnormal number represented by the given floating point type.
for T in [Float16, Float32, Float64]println(T, '\t', floatmin(T), '\t', floatmax(T), '\t', typemin(T), '\t', typemax(T), '\t', eps(T))end
The result of computation is just the digits that represented the rounding.
Scenario 1 (benign cancellation): Addition or subtraction of two numbers of widely different magnitudes: \(a+b\) or \(a-b\) where \(a \gg b\) or \(a \ll b\). We lose the precision in the number of smaller magnitude. Consider \[
a = x.xxx ... \times 2^{30}, \quad
b = y.yyy... \times 2^{-30}
\] What happens when computer calculates \(a+b\)? We get \(a+b=a\)!
@show a =2.0^30@show b =2.0^-30@show a + b == a@showbitstring(a)@showbitstring(a + b);
a = 2.0 ^ 30 = 1.073741824e9
b = 2.0 ^ -30 = 9.313225746154785e-10
a + b == a = true
bitstring(a) = "0100000111010000000000000000000000000000000000000000000000000000"
bitstring(a + b) = "0100000111010000000000000000000000000000000000000000000000000000"
Analysis: suppose we want to compute \(x + y\), \(x, y > 0\). Let the relative error of \(x\) and \(y\) be \[
\delta_x = \frac{[x] - x}{x},
\quad
\delta_y = \frac{[y] - y}{y}
.
\] What the computer actually calculates is \([x] + [y]\), which in turn is represented by \([ [x] + [y] ]\). The relative error of this representation is \[
\delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]}
.
\] Recall that \(|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}\).
We want to find a bound of the relative error of \([[x]+[y]]\) with respect to \(x+y\). Since \(|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})\), we have \[
\small
\begin{split}
| [[x]+[y]]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\
&\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\
&= |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\
&\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\
&\approx 2\epsilon_{\max}(x+y)
\end{split}
\] because \(\epsilon_{\max}^2 \approx 0\).
Scenario 2 (catastrophic cancellation): Subtraction of two nearly equal numbers eliminates significant digits. \(a-b\) where \(a \approx b\). Consider \[
a = x.xxxxxxxxxx1ssss \quad
b = x.xxxxxxxxxx0tttt
\] The result is \(1.vvvvu...u\) where \(u\) are unassigned digits.
a =1.2345678f0# rounding@showbitstring(a) # roundingb =1.2345677f0@showbitstring(b)@show a - b # correct result should be 1f-7@showbitstring(a - b) # must be 1.0000...0 x 2^(-23)@showFloat32(1/2^23);
bitstring(a) = "00111111100111100000011001010001"
bitstring(b) = "00111111100111100000011001010000"
a - b = 1.1920929f-7
bitstring(a - b) = "00110100000000000000000000000000"
Float32(1 / 2 ^ 23) = 1.1920929f-7
The true difference \(x-y\) may lie anywhere in \((0, \frac{1}{2^{p-2}}]\).
Relative error \[
\frac{|x-y -[[x]-[y]]|}{|x-y|}
\] is unbounded – no guarantee of any significant digit!
Implications for numerical computation
Rule 1: add small numbers together before adding larger ones
Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
Rule 3: avoid substraction of two numbers that are nearly equal
Example: in solving quadratic equation \(ax^2 + bx + c = 0\) with \(b > 0\) and \(b^2 - 4ac > 0\), the roots are \[
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
.
\] If \(b \approx \sqrt{b^2 - 4ac}\), then a scenario 2 occurs. A resolution is to compute only one root \(x_1 = \frac{-b - \sqrt{b^2 - 4ac}}{2a}\) by the formula, and find the other root \(x_2\) using the relation \(x_1x_2 = c/a\).
Algebraic laws
Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.
In other words, an input perturbation of order \(10^{-3}\) yield an output perturbation of order \(10^0\). Thats 3 orders of magnutide of relative change!
Floating point representation \([x]\) of a real number \(x\) may introduce such input perturbation easily. The perturbation of output of a problem with respect to the input is called conditioning.
Abstractly, a problem can be viewed as function \(f: \mathcal{X} \to \mathcal{Y}\) where \(\mathcal{X}\) is a normed vector space of data and \(\mathcal{Y}\) is a normed vector space of solutions.
The problem of solving \(Ax=b\) for fixed \(b\) is \(f: A \mapsto A^{-1}b\) with \(\mathcal{X}=\{M\in\mathbb{R}^{n\times n}: M \text{ is invertible} \}\) and \(\mathcal{Y} = \mathbb{R}^n\).
The combination of a problem \(f\) with a given data \(x\) is called a problem instance, or simply problem unless no confusion occurs.
A well-conditioned problem (instance) is one such that all small perturbations of \(x\) lead to only small changes in \(f(x)\).
An ill-conditioned problem is one such that some small perturbation of \(x\) leads to a large change in \(f(x)\).
The (relative) condition number\(\kappa=\kappa(\theta)\) of a problem is defined by \[
\kappa = \lim_{\delta\to 0}\sup_{\|\delta \theta\|\le \delta}\frac{\|\delta f\|/\|f(\theta)\|}{\|\delta \theta\|/\|\theta\|}
,
\] where \(\delta f = f(\theta + \delta \theta) - f(\theta)\).
For the problem of solving \(Ax=b\) for fixed \(b\), \(f: A \mapsto A^{-1}b\), it can be shown that the condition number of \(f\) is \[
\kappa = \|A\|\|A^{-1}\| =: \kappa(A)
,
\] where \(\|A\|\) is the matrix norm induced by vector norm \(\|\cdot\|\), i.e., \[
\|A\| = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|}.
\]
To see this, recall \(f(A) = A^{-1}b = x\). Since \(b = Ax\) and the LHS is fixed, if a perturbation of the input \(A \to A + \delta A\) yields a perturbation of the output \(x \to x + \delta x\) or \(f \to f + \delta f\), then we have \[
0 = (\delta A) x + A \delta x + (\delta A) \delta x
\] or \[
\delta x = -A^{-1}(\delta A)x + o(\Vert \delta A \Vert)
.
\] That is, \(\delta f = f(A + \delta A) - f(A) = - A^{-1}(\delta A)A^{-1}b + o(\Vert \delta A \Vert)\).
Then \[
\begin{aligned}
\kappa &= \lim_{\delta\to 0}\sup_{\|\delta A\|\le \delta}\frac{(\Vert A^{-1}(\delta A)A^{-1}b\Vert + o(\Vert \delta A \Vert)\Vert) / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert}
\\
&=
\lim_{\delta\to 0}\sup_{\|\delta A\|\le \delta} \frac{\Vert A^{-1}(\delta A)A^{-1}b\Vert / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert} + o(1)
\\
&=
\lim_{\delta\to 0}\sup_{\|\delta A\|\le \delta} \frac{\Vert A^{-1}(\delta A)A^{-1}b\Vert / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert}
.
\end{aligned}
\]
From the property of a matrix norm \(\|AB\| \leq \|A\|\|B\|\), \[
\frac{\Vert A^{-1}(\delta A)A^{-1}b\Vert / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert}
\le
\frac{\Vert A^{-1} \Vert \Vert \delta A \Vert \Vert A^{-1}b\Vert / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert}
% = \Vert A^{-1} \Vert \Vert A \Vert
,
\] hence \(\kappa \le \Vert A^{-1} \Vert \Vert A \Vert\).
It can be shown that for any invertible \(A\) and \(b\), there exits a perturbation \(\delta A\) such that \[
\frac{\Vert A^{-1}(\delta A)A^{-1}b\Vert / \Vert A^{-1}b \Vert}{\Vert \delta A \Vert / \Vert A \Vert}
= \Vert A^{-1} \Vert \Vert A \Vert
.
\]
It follows that \(\kappa \ge \Vert A^{-1} \Vert \Vert A \Vert\).
If Euclidean norm is used, then \[
\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A),
\] the ratio of the maximum and minimum singular values of \(A\).
In the above problem, the condition number is matrix \(A\) (w.r.t. Euclidean norm) is