Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Numerical methods
Literature
I Stoyan Gisbert, Agnes Baran, Elementary NumericalMathematics for programmers and engineers, Birkhauser, 2016
I William H. Press, Brian P.Flannery, Saul A. Teukolsky, WilliamT. Vettering, Numerical Recipes, Cambridge Univ. Press,Cambridge, New York,.. 1986,1987,1988,1989,... computerprograms are available in FORTRAN or Pascal, and C.
e-mail: [email protected]: https://arato.inf.unideb.hu/noszalycsaba/numenconsulting hours: later
Topics
I Floating point numbers
I Norms, condition numbers
I Systems of linear equations
I Least square approximations
I Interpolations
I Eigenvalue problems
I Nonlinear equations, systems of nonlinear equations
I Approximation of integrals
Floating point numbers
Example.a = 10
0.3721 =3
10+
7
102+
2
103+
1
104
21.65 = 0.2165 · 102 =
(2
10+
1
102+
6
103+
5
104
)· 102
a = 2
0.1101 =1
2+
1
22+
0
23+
1
24
0.001011 = 0.1011 · 2−2 =
(1
2+
0
22+
1
23+
1
24
)· 2−2
Floating point numbers
The form of the nonzero floating point numbers is:
±ak(m1
a+
m2
a2+ · · ·+ mt
at
),
where
a > 1 is the base or (radix)of the floating point representation(integer),
± is the sign,
t > 1 the length of the mantissa of the representation,
k is the exponent (integer), k ∈ [k−, k+], where the lower limitk− < 0 and upper limit k+ > 0 are given by the representation.
Floating point numbers
The mantissa is (m1
a+
m2
a2+ · · ·+ mt
at
),
a fraction, its value is always below 1 .
The values of a, t, k−, k+ uniquely determine the set of the floatingpoint numbers.
Normalized numbers: 1 ≤ m1 ≤ a− 1
Floating point numbers
The following brief notation will be also used: [±|k |m1, . . . ,mt ]
The largest floating point number is:
M∞ = ak+
(a− 1
a+
a− 1
a2+ · · ·+ a− 1
at
)
= ak+
(1− 1
a+
1
a− 1
a2+ · · ·+ 1
at−1 −1
at
)= ak+
(1− a−t
)≈ ak+ .
The smallest is: −M∞ .
The floating point numbers form discrete subset of the rationalnumbers:
[−M∞,M∞] .
This range depends mainly on the value of k+ (and on a.)
The smallest positive normalized floating point number is
ε0 = ak−
(1
a+ 0 + · · ·+ 0
)= ak−−1 .
For non-normalized (subnormalized) numbers ε0 = ak−−t .
The zero is: k = 0, mi = 0, for i = 1, . . . t. (non-normalized)
In the interval (−ε0, ε0) there is only one number, the zero.((−ε0, ε0) is called black hole)
The 1 is always an accurate floating point number:
1 = a1 · 1
a, or 1 = [+|1|1, 0, . . . , 0]
The right neighbour of 1
1 + ε1 = [+|1|1, 0, . . . , 0, 1]
or:
1 + ε1 = a
(1
a+ 0 + · · ·+ 0 +
1
at
)= 1 + a1−t
i.e. the distance between the number 1 and its right neighbour is
ε1 = a1−t
(we call it machine epsilon).
Example.
The positive normalized floating point numbers in the case ofa = 2, t = 4, k− = −3, k+ = 2
k = 0 k = 1 k = 2 k = −1 k = −2 k = −3
0.1000 816
88
84
832
864
8128
0.1001 916
98
94
932
964
9128
0.1010 1016
108
104
1032
1064
10128
0.1011 1116
118
114
1132
1164
11128
0.1100 1216
128
124
1232
1264
12128
0.1101 1316
138
134
1332
1364
13128
0.1110 1416
148
144
1432
1464
14128
0.1111 1516
158
154
1532
1564
15128
M∞ = 22(1− 2−4) = 154 and ε0 = 2−3−1 = 1
16
(= 8
128
)
The IEEE standard of the floating point arithmetics for binarysystems are:
single precision double precision
total length 32 bits 64 bits
mantissa t+1 23+1 bits 52+1 bits
length of exponent 8 bits 11 bits
ε1 ≈ 1.19 · 10−7 ≈ 2.22 · 10−16
M∞ ≈ 1038 ≈ 10308
Since in normalized form m1 is always 1, therefore we can omit it(hidden bit).The sign of the number is always one bit long (+1 in the table).
Examples
1. How large is the range of the exponent for single precisionarithmetics ?
The exponent is 8 bits long.8 bits =⇒ 28 = 256 numbers. The exponent can be negative aswell, i.e.
k ∈ [−127, 128], k− = −127, k+ = 128
(Instead of k the number k + 127 is stored)
Then M∞ ≈ 2128 ≈ 3.4× 10+38.
Examples
2. How large is the range of the exponent for double precisionarithmetics ?
The exponent field is 11 bits long. 211 = 2048,
k ∈ [−1023, 1024], k− = −1023, k+ = 1024
Therefore M∞ ≈ 21024 ≈ 1.8× 10+308 ≈ 10+308.
Let 0 < x < M∞ be a normalized floating point number:
0 < x = ak(m1
a+
m2
a2+ · · ·+ mt
at
)< M∞ ,
then the next floating point number to x larger than x isx = x + ak−t , i.e.
x = x + ak
(0 + 0 + · · ·+ 0 +
1
at
)= x + ak−t .
Thus the distance between the two numbers is δx = x − x = ak−t .
x − x = δx = ak−t = ak−1+1−t = ak−1a1−t = ak−1ε1 ≤ xε1
If −M∞ < x < 0 and x is the left neighbour of x then
|x − x | ≤ ε1|x |
Rounding
Rounding at input:Let x be a real number in the range: x ∈ [−M∞,M∞], and denoteby fl(x) the floating point number belonging to x , which can beassociated to it, either by rounding or by chopping. In both casesfl(x) = 0, if x falls into the black hole.
In the case of rounding:
fl(x) =
{0, if |x | < ε0
the closest floating point number to x , if |x | ≥ ε0
In the case of chopping:
fl(x) =
{0, if |x | < ε0
the closest floating p. number to x towards 0, if |x | ≥ ε0
The error due to the representation
at rounding is:
|x − fl(x)| = |x − fl(x)| ≤
{ε0, if |x | < ε012ε1|x |, if |x | ≥ ε0
at chopping is:
|x − fl(x)| ≤
{ε0, if |x | < ε0
ε1|x |, if |x | ≥ ε0
The error is the same, if x falls into the black hole.
Rounding and chopping at basic arithmeticaloperations:
Example 1:
a = 10, t = 3, k− = −2, k+ = 2x = 0.425 · 10−1, y = 0.677 · 10−2
fl(x + y) =?
To add the numbers we need a common exponent, therefore weshift yy → y = 0.0677 · 10−1
x + y = 0.425 · 10−1 + 0.0677 · 10−1 = 0.4927 · 10−1
fl(x + y) =
{0.492 · 10−1, chopping
0.493 · 10−1, regular rounding
Example 2:
a = 10, t = 3, k− = −2, k+ = 2x = 0.367 · 10−2, y = 0.682 · 10−2
fl(x + y)?
x + y = 0.367 · 10−2 + 0.682 · 10−2 = 1.049 · 10−2 = 0.1049 · 10−1
fl(x + y) =
{0.104 · 10−1, chopping
0.105 · 10−1, regular rounding .
Rounding or chopping happens when we store the result intomemory location. The arithmetic register is generally longer thanthe memory locations.
Basic arithmetical operations.
Denote by 4 one of the four basic operations, and let x and y befloating point numbers. Assume that |x4y | < M∞, moreover thecomputer calculates the result of the operation whitout error andthen it assigns to the result a floating point number.
In the case of rounding:
|fl(x4y)− x4y | ≤
{ε0, ha |x4y | < ε012ε1|x4y |, ha |x4y | ≥ ε0
In the case of truncating:
|fl(x4y)− x4y | ≤
{ε0, ha |x4y | < ε0
ε1|x4y |, ha |x4y | ≥ ε0
Propagation of errors in arithmetic operationsWe consider the four basis arithmetic commands separately. Weconsider the addition, subtraction, multiplication and division ofpositive floating point numbers: x and y . We assume that x andy already have errors:
δ(x) = |x − fl(x)| δ(y) = |y − fl(y)| .
We use the theorem valid for the error of a continuouslydifferentiable function f (x , y) , namely
δ(f (x , y)) = |f (fl(x), fl(y))− f (x , y)| = |∂f
∂x|δ(x) + |∂f
∂y|δ(y)| .
We assume that the error comes only from the input errors of thearguments, the calculation of f (x , y) does not introduce additionalerrors.Absolute error of the addition is:
f (x , y) = x + y∂f
∂x= 1
∂f
∂y= 1
δ(x + y) = δ(x) + δ(y)
Propagation of errors in arithmetic operations
Absolute error of the subtraction is:
f (x , y) = x − y∂f
∂x= 1
∂f
∂y= −1 |∂f
∂y| = 1
δ(x − y) = δ(x) + δ(y) .
Note that this error is the same as that of the addition.
Propagation of errors in arithmetic operations
Absolute error of the multiplication:
f (x , y) = x ∗ y∂f
∂x= y
∂f
∂y= x
δ(x ∗ y) = |y |δ(x) + |x |δ(y) ≈ |fl(y)|δ(x) + |fl(x)|δ(y)
Absolute error of the division:
f (x , y) = x/y∂f
∂x=
1
y
∂f
∂y= − x
y2|∂f
∂y| =|x ||y2|
δ(x/y) =δx
|y |+ δy
|x ||y2|≈ |fl(y)|δx + |fl(x)|δy
|fl(y)|2
Propagation of errors in arithmetic operations
Relative errors are the ratio of the absolute error and the modulusof the result.
Relative error of the addition:
δ(x + y)
|fl(x) + fl(y)|=
δx + δy
|fl(x) + fl(y)|=
δx + δy
|fl(x)|+ |fl(y)|
Relative error of the subtraction:
δ(x − y)
|fl(x)− fl(y)|=
δx + δy
|fl(x)− fl(y)|
Propagation of errors in arithmetic operations
If the first r digits of x and y are the same, then we loose rvaluable digits from the result.Example:Assume that a = 10, t = 8.
fl(x) = .76545421× 101 δx = 10−7
fl(y) = .76544200× 101 δy = 10−7
Then the difference is
z = fl(x)− fl(y) = .00001221× 101 = 0.1221× 10−3 ≈ 10−4
The absolute error of z is
δ(x − y) = δx + δy = 10−7 + 10−7 .
Propagation of errors in arithmetic operations
The relative errors of the terms are:
δx
fl(x)=
10−7
7.6≈ 10−8
δy
fl(y)=
10−7
7.6≈ 10−8
The relative error of the result is:
δx + δy
|fl(x)− fl(y)|=
2× 10−7
10−4= 2× 10−3 .
The relative errors of the terms increased by a factor of 105.This warns us that the subtraction is a dangerous operation innumerical calculations.
Propagation of errors in arithmetic operations
The relative error of the multiplication:
δ(x × y)
|fl(x)||fl(y)|=|fl(x)|δy + |fl(y)|δx
|fl(x)||fl(y)|=
δy
|fl(y)|+
δx
|fl(x)|
The relative error of the division:
δ(x/y)
|fl(x)fl(y) |=
|fl(x)|δy+|fl(y)|δx|fl(y)|
|fl(x)|=|fl(x)|δy + |fl(y)|δx
|fl(x)||fl(y)|=
δy
|fl(y)|+
δx
|fl(x)|
Note that at multiplication and division the relative errors of theresults are the sum of the relative errors of the terms x and y .