Numerical methods - arato.inf.unideb.hu · Floating point numbers The form of the nonzero oating point numbers is: ak m 1 a + m 2 a2 m t at where a >1 is the base or (radix)of the

Numerical methods

Literature

I Stoyan Gisbert, Agnes Baran, Elementary NumericalMathematics for programmers and engineers, Birkhauser, 2016

I William H. Press, Brian P.Flannery, Saul A. Teukolsky, WilliamT. Vettering, Numerical Recipes, Cambridge Univ. Press,Cambridge, New York,.. 1986,1987,1988,1989,... computerprograms are available in FORTRAN or Pascal, and C.

e-mail: [email protected]: https://arato.inf.unideb.hu/noszalycsaba/numenconsulting hours: later

https://arato.inf.unideb.hu/noszaly.csaba/numen

Topics

I Floating point numbers

I Norms, condition numbers

I Systems of linear equations

I Least square approximations

I Interpolations

I Eigenvalue problems

I Nonlinear equations, systems of nonlinear equations

I Approximation of integrals

Floating point numbers

Example.a = 10

0.3721 =3

10+

7

102+

2

103+

1

104

21.65 = 0.2165 · 102 =

(2

10+

1

102+

6

103+

5

104

)· 102

a = 2

0.1101 =1

2+

1

22+

0

23+

1

24

0.001011 = 0.1011 · 2−2 =

(1

2+

0

22+

1

23+

1

24

)· 2−2


The form of the nonzero floating point numbers is:

±ak(m1

a+

m2

a2+ · · ·+ mt

at

),

where

a > 1 is the base or (radix)of the floating point representation(integer),

± is the sign,

t > 1 the length of the mantissa of the representation,

k is the exponent (integer), k ∈ [k−, k+], where the lower limitk− < 0 and upper limit k+ > 0 are given by the representation.


The mantissa is (m1

a+

m2

a2+ · · ·+ mt

at

),

a fraction, its value is always below 1 .

The values of a, t, k−, k+ uniquely determine the set of the floatingpoint numbers.

Normalized numbers: 1 ≤ m1 ≤ a− 1


The following brief notation will be also used: [±|k |m1, . . . ,mt ]

The largest floating point number is:

M∞ = ak+

(a− 1

a+

a− 1

a2+ · · ·+ a− 1

at

)

= ak+

(1− 1

a+

1

a− 1

a2+ · · ·+ 1

at−1 −1

at

)= ak+

(1− a−t

)≈ ak+ .

The smallest is: −M∞ .

The floating point numbers form discrete subset of the rationalnumbers:

[−M∞,M∞] .

This range depends mainly on the value of k+ (and on a.)

The smallest positive normalized floating point number is

ε0 = ak−

(1

a+ 0 + · · ·+ 0

)= ak−−1 .

For non-normalized (subnormalized) numbers ε0 = ak−−t .

The zero is: k = 0, mi = 0, for i = 1, . . . t. (non-normalized)

In the interval (−ε0, ε0) there is only one number, the zero.((−ε0, ε0) is called black hole)

The 1 is always an accurate floating point number:

1 = a1 · 1

a, or 1 = [+|1|1, 0, . . . , 0]

The right neighbour of 1

1 + ε1 = [+|1|1, 0, . . . , 0, 1]

or:

1 + ε1 = a

(1

a+ 0 + · · ·+ 0 +

1

at

)= 1 + a1−t

i.e. the distance between the number 1 and its right neighbour is

ε1 = a1−t

(we call it machine epsilon).

Example.

The positive normalized floating point numbers in the case ofa = 2, t = 4, k− = −3, k+ = 2

k = 0 k = 1 k = 2 k = −1 k = −2 k = −3

0.1000 816

88

84

832

864

8128

0.1001 916

98

94

932

964

9128

0.1010 1016

108

104

1032

1064

10128

0.1011 1116

118

114

1132

1164

11128

0.1100 1216

128

124

1232

1264

12128

0.1101 1316

138

134

1332

1364

13128

0.1110 1416

148

144

1432

1464

14128

0.1111 1516

158

154

1532

1564

15128

M∞ = 22(1− 2−4) = 154 and ε0 = 2−3−1 = 1

16

(= 8

128

)

The IEEE standard of the floating point arithmetics for binarysystems are:

single precision double precision

total length 32 bits 64 bits

mantissa t+1 23+1 bits 52+1 bits

length of exponent 8 bits 11 bits

ε1 ≈ 1.19 · 10−7 ≈ 2.22 · 10−16

M∞ ≈ 1038 ≈ 10308

Since in normalized form m1 is always 1, therefore we can omit it(hidden bit).The sign of the number is always one bit long (+1 in the table).

Examples

1. How large is the range of the exponent for single precisionarithmetics ?

The exponent is 8 bits long.8 bits =⇒ 28 = 256 numbers. The exponent can be negative aswell, i.e.

k ∈ [−127, 128], k− = −127, k+ = 128

(Instead of k the number k + 127 is stored)

Then M∞ ≈ 2128 ≈ 3.4× 10+38.

Examples

2. How large is the range of the exponent for double precisionarithmetics ?

The exponent field is 11 bits long. 211 = 2048,

k ∈ [−1023, 1024], k− = −1023, k+ = 1024

Therefore M∞ ≈ 21024 ≈ 1.8× 10+308 ≈ 10+308.

Let 0 < x < M∞ be a normalized floating point number:

0 < x = ak(m1

a+

m2

a2+ · · ·+ mt

at

)< M∞ ,

then the next floating point number to x larger than x isx = x + ak−t , i.e.

x = x + ak

(0 + 0 + · · ·+ 0 +

1

at

)= x + ak−t .

Thus the distance between the two numbers is δx = x − x = ak−t .

x − x = δx = ak−t = ak−1+1−t = ak−1a1−t = ak−1ε1 ≤ xε1

If −M∞ < x < 0 and x is the left neighbour of x then

|x − x | ≤ ε1|x |

Rounding

Rounding at input:Let x be a real number in the range: x ∈ [−M∞,M∞], and denoteby fl(x) the floating point number belonging to x , which can beassociated to it, either by rounding or by chopping. In both casesfl(x) = 0, if x falls into the black hole.

In the case of rounding:

fl(x) =

{0, if |x | < ε0

the closest floating point number to x , if |x | ≥ ε0

In the case of chopping:

fl(x) =

{0, if |x | < ε0

the closest floating p. number to x towards 0, if |x | ≥ ε0

The error due to the representation

at rounding is:

|x − fl(x)| = |x − fl(x)| ≤

{ε0, if |x | < ε012ε1|x |, if |x | ≥ ε0

at chopping is:

|x − fl(x)| ≤

{ε0, if |x | < ε0

ε1|x |, if |x | ≥ ε0

The error is the same, if x falls into the black hole.

Rounding and chopping at basic arithmeticaloperations:

Example 1:

a = 10, t = 3, k− = −2, k+ = 2x = 0.425 · 10−1, y = 0.677 · 10−2

fl(x + y) =?

To add the numbers we need a common exponent, therefore weshift yy → y = 0.0677 · 10−1

x + y = 0.425 · 10−1 + 0.0677 · 10−1 = 0.4927 · 10−1

fl(x + y) =

{0.492 · 10−1, chopping

0.493 · 10−1, regular rounding

Example 2:

a = 10, t = 3, k− = −2, k+ = 2x = 0.367 · 10−2, y = 0.682 · 10−2

fl(x + y)?

x + y = 0.367 · 10−2 + 0.682 · 10−2 = 1.049 · 10−2 = 0.1049 · 10−1

fl(x + y) =

{0.104 · 10−1, chopping

0.105 · 10−1, regular rounding .

Rounding or chopping happens when we store the result intomemory location. The arithmetic register is generally longer thanthe memory locations.

Basic arithmetical operations.

Denote by 4 one of the four basic operations, and let x and y befloating point numbers. Assume that |x4y | < M∞, moreover thecomputer calculates the result of the operation whitout error andthen it assigns to the result a floating point number.

In the case of rounding:

|fl(x4y)− x4y | ≤

{ε0, ha |x4y | < ε012ε1|x4y |, ha |x4y | ≥ ε0

In the case of truncating:

|fl(x4y)− x4y | ≤

{ε0, ha |x4y | < ε0

ε1|x4y |, ha |x4y | ≥ ε0

Propagation of errors in arithmetic operationsWe consider the four basis arithmetic commands separately. Weconsider the addition, subtraction, multiplication and division ofpositive floating point numbers: x and y . We assume that x andy already have errors:

δ(x) = |x − fl(x)| δ(y) = |y − fl(y)| .

We use the theorem valid for the error of a continuouslydifferentiable function f (x , y) , namely

δ(f (x , y)) = |f (fl(x), fl(y))− f (x , y)| = |∂f

∂x|δ(x) + |∂f

∂y|δ(y)| .

We assume that the error comes only from the input errors of thearguments, the calculation of f (x , y) does not introduce additionalerrors.Absolute error of the addition is:

f (x , y) = x + y∂f

∂x= 1

∂f

∂y= 1

δ(x + y) = δ(x) + δ(y)

Propagation of errors in arithmetic operations

Absolute error of the subtraction is:

f (x , y) = x − y∂f

∂x= 1

∂f

∂y= −1 |∂f

∂y| = 1

δ(x − y) = δ(x) + δ(y) .

Note that this error is the same as that of the addition.


Absolute error of the multiplication:

f (x , y) = x ∗ y∂f

∂x= y

∂f

∂y= x

δ(x ∗ y) = |y |δ(x) + |x |δ(y) ≈ |fl(y)|δ(x) + |fl(x)|δ(y)

Absolute error of the division:

f (x , y) = x/y∂f

∂x=

1

y

∂f

∂y= − x

y2|∂f

∂y| =|x ||y2|

δ(x/y) =δx

|y |+ δy

|x ||y2|≈ |fl(y)|δx + |fl(x)|δy

|fl(y)|2


Relative errors are the ratio of the absolute error and the modulusof the result.

Relative error of the addition:

δ(x + y)

|fl(x) + fl(y)|=

δx + δy

|fl(x) + fl(y)|=

δx + δy

|fl(x)|+ |fl(y)|

Relative error of the subtraction:

δ(x − y)

|fl(x)− fl(y)|=

δx + δy

|fl(x)− fl(y)|


If the first r digits of x and y are the same, then we loose rvaluable digits from the result.Example:Assume that a = 10, t = 8.

fl(x) = .76545421× 101 δx = 10−7

fl(y) = .76544200× 101 δy = 10−7

Then the difference is

z = fl(x)− fl(y) = .00001221× 101 = 0.1221× 10−3 ≈ 10−4

The absolute error of z is

δ(x − y) = δx + δy = 10−7 + 10−7 .


The relative errors of the terms are:

δx

fl(x)=

10−7

7.6≈ 10−8

δy

fl(y)=

10−7

7.6≈ 10−8

The relative error of the result is:

δx + δy

|fl(x)− fl(y)|=

2× 10−7

10−4= 2× 10−3 .

The relative errors of the terms increased by a factor of 105.This warns us that the subtraction is a dangerous operation innumerical calculations.


The relative error of the multiplication:

δ(x × y)

|fl(x)||fl(y)|=|fl(x)|δy + |fl(y)|δx

|fl(x)||fl(y)|=

δy

|fl(y)|+

δx

|fl(x)|

The relative error of the division:

δ(x/y)

|fl(x)fl(y) |=

|fl(x)|δy+|fl(y)|δx|fl(y)|

|fl(x)|=|fl(x)|δy + |fl(y)|δx

|fl(x)||fl(y)|=

δy

|fl(y)|+

δx

|fl(x)|

Note that at multiplication and division the relative errors of theresults are the sum of the relative errors of the terms x and y .

Documents

Numerical methods - arato.inf.unideb.hu · Floating point numbers The form of the nonzero oating point numbers is: ak m 1 a + m 2 a2 m t at where a >1 is the base or (radix)of the