Tutorial 02 - Error, accuracy, stability

Floating point representation of numbers, roundoff error, truncation error, numerical stability and condition number.

import numpy as np
import matplotlib.pyplot as plt

Main sources of numerical errors:

  • rounding and cancellation (due to the use of finite-precision arithmetic)

  • truncation or approximation errors (due to approximation of infinite sequences or continuous functions by a finite number of samples)

Basic definitions

Absolute and relative error:

If \( \tilde{x} \) is an approximation for \( x \), then:

  • absolute error \( A(x) = | \tilde{x} - x | \)

  • relative error \( R(x) = | \tilde{x} - x | \, / \, | x | \)

Decimal precision:

Given a relative error \( R \), the decimal precision \( p \) is the largest integer such that \( R \leq 5 \times 10^{-p} \)

Big-O notation:

The error term in an approximation to a mathematical function can be described by the big-O notation:

\[ f(x) = O(g(x)) \quad \text{as} \quad x \rightarrow a \]

if and only if

\[ |f(x)| \leq M |g(x)| \quad \text{as}\quad |x - a| < \delta \quad \text{where} \quad M, a > 0. \]

Roundoff error

Representation of real numbers in computer

Most widely used representation of real numbers in computers is the floating-point representation. Floating-point representations have a base \( \beta \), exponent \( E \) and precision \( p \). In general, a floating-point number is represented as

\[ f = \pm \, d_1.d_2d_3 \dots d_p \times \beta^E, \]

where \( d_1.d_2d_3 \dots d_p \) is called significand (also mantissa).

Properties of floating-point systems:

  • Smallest positive number (underflow if below)

  • Largest number (overflow if above)

  • Machine epsilon, \( \varepsilon \), defined as the difference between 1 and the next larger floating point number (upper bound on the relative error due to rounding in floating point arithmetic)

  • Special values: zero (0), infinities (+Inf, -Inf), not a number (NaN)

Exercise 02.1: Consider a decimal system of floating-point numbers defined as \( f = \pm \, d_1.d_2 \cdot \beta^E, \) where \( \beta = 10 \) and \( E \in \{-2, -1, 0 \} \).

  1. How many numbers such system represents?

  2. What is the smallest positive number and the largest number?

  3. What is the machine epsilon?

  4. What is the distribution of numbers on the real line?

# add your code here

Exercise 02.2: Consider a binary system of floating-point numbers defined as \( f = \pm \, d_1.d_2 \cdot \beta^E, \) where \( \beta = 2 \) and \( E = \{-1, 0, 1 \} \).

  1. How many numbers such system represents?

  2. What is the smallest positive number and the largest number?

  3. What is the machine epsilon?

  4. What is the distribution of numbers on the real line?

# add your code here

Note that unlike the real number system (which is continuous), a floating-point systems have always gaps between numbers. If a number is not exactly representable, then it must be approximated by one of the nearest representable values. Furthermore, the distribution of numbers on the real line is not uniform.

Implementation of floating-point systems

Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 standard for floating-point arithmetic was established. Since then, most implementations of the floating-point systems in computers conform to the rules defined by IEEE.

Half precision (IEEE 754-2008):

Total storage alloted is 16 bits (10 bits for mantissa, 5 bits for exponent, 1 bit for sign)

print(np.finfo(np.float16))

Single precision:

Total storage alloted is 32 bits (23 bits for mantissa, 8 bits for exponent, 1 bit for sign)

print(np.finfo(np.float32))

Double precision:

Total storage allocated is 64 bits (52 bits for mantissa, 11 bits for exponent, 1 bit for sign)

print(np.finfo(np.float64))

The IEEE 754 standard alse defines algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE 754 standard.

Example: Using the double precision floating-point format, compare that \( 0.1 + 0.2 = 0.3 \).

print((0.1 + 0.2) == 0.3)

The result above is not a Python bug (read more e.g. here or here). Representing infinitely many real numbers by a finite number of bits requires an approximate representation. Given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. None of the numbers 0.1, 0.2, and 0.3 (actually most of the decimal fractions) has an exact representation as a binary floating-point number, no matter how many base 2 digits you’re willing to use.

Note: The decimal numbers could be represented exactly by the base-10 floating-point systems. However, base-10 implementations are rare because base-2 (binary) arithmetic is so much faster on computers.

Python by default displays always a rounded value (to keep the number of digits manageable):

print(0.1)

but the actual stored value is the nearest representable binary fraction:

print(format(0.1, ".55f"))

Exercise 02.3: Using the double precision floating-point format, calculate \( x = 0.1 + 0.2 - 0.3 \). Afterwards, perform \( 100 \ \times \) the operation \( x := x + x \) and display the value of \( x \).

# add your code here

Exercise 02.4: Consider the following function,

\[ f(x) = \frac{1 - \cos{x}}{x^2}. \]

The behaviour of this function as \( x \) approaches zero can be determined by evaluating the limit

\[ \lim_{x \to 0} f(x) = \frac{1}{2}. \]

Can be found the same result using the floating-point representation?

# add your code here

Exercise 02.5: Calculate the sum of the following series,

\[ 0.9^0, \ 0.9^1, \ \dots, \ 0.9^n, \]

for \( n = 400 \) in the forward and backward directions. Compare the results.

# add your code here

Note: When summing a series of numbers using floating-point systems, always sum the smaller numbers first. To improve accuracy when summing numbers, consider the Kahan summation algorithm.

Summary

  • Floating point arithmetic is not commutative, associative, and not necessarily distributive:

valid operations:

\[\begin{split} \begin{align} & 1 \cdot x = x, \\ & x \cdot y = y \cdot x, \\ & x + x = 2 \cdot x \end{align} \end{split}\]

not necessarily valid operations:

\[\begin{split} \begin{align} & x \cdot x^{-1} = 1, \\ & (1 + x) - 1 = x, \\ & (x + y) + z = x + (y + z) \end{align} \end{split}\]
  • Errors propagate between calculations. A small error in the input may result in a large error in the output.

Which operations should I avoid in order to minimize the roundoff error?

  • subtractions of numbers that are nearly equal

  • additions and subtractions of numbers that differ greatly in magnitude

Representation of integers in computer

Although standard arithmetic operations are safe, some care may be necessary when working with large numbers

16 bit unsigned

print(np.iinfo(np.uint16))

16 bit signed

print(np.iinfo(np.int16))

32 bit signed

print(np.iinfo(np.int32))

Truncation error

The discrepancy between the true answer and the answer obtained by a numerical method regardless the roundoff error. Truncation error would persist even if we would have an infinitely accurate representation of numbers.

Exercise 02.6: Construct a Taylor series of the function

\[ f(x) = e^x \]

using the first 3 elements.

  1. Plot the Taylor series in \( x \in [-1, 1] \) and compare the approximation with \( f(x) \).

  2. Calculate and plot the absolute and relative errors of the approximation. Find the maximum values.

# add your code here

Combination of errors

In general, the roundoff errors and the truncation errors may combine.

Exercise 02.7: Calculate the derivative of the function

\[ f(x) = \sin{x} \]

in \( x \in [-2 \pi, 2 \pi] \) using finite differences with forward and central schemes, respectively, i.e.,

\[ f_{f}^{\prime}(x) = \frac{f(x + h) - f(x)}{h}, \qquad f_{c}^{\prime}(x) = \frac{f(x + h) - f(x - h)}{2 h}, \]

where \( h \) is a finite step.

  1. Compare the approximations with \( f^{\prime}(x) = \cos{x}.\)

  2. Calculate the relative error of the approximations above for \( x = 1 \) and \( h = 2^{-n} \), where \( n \in \{ 1, \dots, 60 \} \). Find the optimal value of \( h \) for each approximation.

# add your code here

The relative error drops faster in the case of central scheme, because it is second order of accuracy, i.e.,

\[ f(x + h) = f(x) + h f^{\prime}(x) + \frac{h^2}{2} f^{\prime\prime}(x) + \mathcal{O}(h^3), \]
\[ f(x - h) = f(x) - h f^{\prime}(x) + \frac{h^2}{2} f^{\prime\prime}(x) + \mathcal{O}(h^3), \]
\[ \to f^{\prime}(x) = \frac{f(x + h) - f(x - h)}{2 h} + \mathcal{O}(h^2), \]

while the forward scheme is first order of accuracy, i.e.,

\[ f(x + h) = f(x) + h f^{\prime}(x) + \mathcal{O}(h^2) \ \to \ f^{\prime}(x) = \frac{f(x + h) - f(x)}{h} + \mathcal{O}(h). \]

Numerical stability

Numerical method may magnify the roundoff error introduced into the computation at an early stage. Such a method is called unstable.

Exercise 02.8: Solve the 1st order ordinary differential equation

\[ y^{\prime}(x) + y(x) = 0, \quad y(0) = 1 \]

in \( x \in [0, 10] \) using forward and central differencing schemes with \( h = 0.1 \). Compare both results with analytical solution \( y(x) = \exp(-x) \).

# add your code here

Note: Even though the central differencing scheme has higher order of accuracy than the forward scheme, it it not always better. The suitability has to be examined for each problem separately.

Exercise 02.9: Using the half-precision floating-point format, calculate the first 20 integer powers of the number called “golden mean”,

\[ \phi = \frac{\sqrt{5} - 1}{2}. \]

with the following recursive algorithm,

\[ \phi^n = \phi^{n-1} - \phi^{n}. \]

Compare the calculated values with the ones obtained by the simple multiplication.

# add your code here

Condition number

Condition number measures how much the output value of the function can change for a small change in the input argument (i.e., sensitivity to input error). Given the problem \( f \) and the input \( x \), the condition number \( C_p \) is defined as

\[ C_p = \frac{\| \delta f(x) \| \, / \, \| f(x) \|}{\| \delta x \| \, / \, \| x \|}. \]
  • A problem with \( C_p \approx 1 \) is said to be well-conditioned

  • A problem with \( C_p > 100 \) is said to be ill-conditioned

  • A problem with \( C_p > \varepsilon^{-1} \) is not solvable within the specified precision

Exercise 02.10: Find the condition number of the following system of equations,

\[ x + \alpha y = 1, \]
\[ \alpha x + y = 0, \]

depending on the parameter \( \alpha \in [-2, 2] \). Plot the result and decide for which \( \alpha \) the system is ill-conditioned.

# add your code here