Upload
duongtuong
View
235
Download
0
Embed Size (px)
Citation preview
Numbers in scientific computing
I Integers: . . . ,�2,�1, 0, 1, 2, . . .
I Rational numbers: 1/3, 22/7: not often encountered
I Real numbers 0, 1,�1.5, 2/3,p2, log 10, . . .
I Complex numbers 1 + 2i ,p3�
p5i , . . .
Computers use a finite number of bits to represent numbers,so only a finite number of numbers can be represented, and noirrational numbers (even some rational numbers).
Integers
Scientific computation mostly uses real numbers. Integers aremostly used for array indexing.
16/32/64 bit: short,int,long,long long in C, size notstandardized, use sizeof(long) et cetera. (Also unsigned int
et cetera)
INTEGER*2/4/8 Fortran
Negative integers
Use of sign bit: typically first bit
s i1 . . . in
Simplest solution: n > 0, fl(n) = +1, i1, . . . i31, thenfl(�n) = �1, i1, . . . i31
Problem: +0 and �0; also impractical in other ways.
Sign bit
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1as unsigned int 0 . . . 231 � 1 231 . . . 232 � 1as naive signed 0 . . . 231 � 1 �0 . . . �231 + 1
2’s complement
Better solution: if 0 n 231 � 1, then fl(n) = 0, i1, . . . , i31;1 n 231 then fl(�n) = fl(232 � n).
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1as unsigned int 0 . . . 231 � 1 231 . . . 232 � 1
as 2’s comp. integer 0 · · · 231 � 1 �231 · · · �1
Subtraction in 2’s complement
Subtraction m � n is easy.
I Case: m < n. Observe that �n has the bit pattern of 232 � n.Also, m + (232 � n) = 232 � (n �m) where0 < n �m < 231 � 1, so 232 � (n �m) is the 2’s complementbit pattern of m � n.
I Case: m > n. The bit pattern for �n is 232 � n, so m � n asunsigned is m + 232 � n = 232 + (m � n). Here m � n > 0.The 232 is an overflow bit; ignore.
Floating point numbers
Analogous to scientific notation x = 6.022 · 1023:
x = ±⌃t�1i=0di�
�i �e
I sign bit
I � is the base of the number system
I 0 d
i
� � 1 the digits of the mantissa: with the radix
point mantissa < �
Ie 2 [L,U] exponent
Some examples
� t L U
IEEE single (32 bit) 2 24 -126 127IEEE double (64 bit) 2 53 -1022 1023
Old Cray 64bit 2 48 -16383 16384IBM mainframe 32 bit 16 6 -64 63
packed decimal 10 50 -999 999
BCD is tricky: 3 decimal digits in 10 bits
(we will often use � = 10 in the examples, because it’s easier toread for humans, but all practical computers use � = 2)
Normalized numbers
Require first digit in the mantissa to be nonzero.Equivalent: mantissa part 1 x
m
< �
Unique representation for each number,also: in binary this makes the first digit 1, so we don’t need tostore that.
With normalized numbers, underflow threshold is 1 · �L;‘gradual underflow’ possible, but usually not e�cient.
IEEE 754
sign exponent mantissas e1 · · · e8 s1 . . . s2331 30 · · · 23 22 · · · 0
(e1 · · · e8) numerical value(0 · · · 0) = 0 ±0.s1 · · · s23 ⇥ 2�126
(0 · · · 01) = 1 ±1.s1 · · · s23 ⇥ 2�126
(0 · · · 010) = 2 ±1.s1 · · · s23 ⇥ 2�125
· · ·(01111111) = 127 ±1.s1 · · · s23 ⇥ 20
(10000000) = 128 ±1.s1 · · · s23 ⇥ 21
· · ·(11111110) = 254 ±1.s1 · · · s23 ⇥ 2127
(11111111) = 255 ±1 if s1 · · · s23 = 0, otherwise NaN
Representation error
Error between number x and representation x̃ :absolute x � x̃ or |x � x̃ |relative x�x̃
x
or�� x�x̃
x
��
Equivalent: x̃ = x ± ✏ , |x � x̃ | ✏ , x̃ 2 [x � ✏, x + ✏].
Also: x̃ = x(1 + ✏) often shorthand for�� x̃�x
x
�� ✏
Machine precision
Any real number can be represented to a certain precision:x̃ = x(1 + ✏) wheretruncation: ✏ = ��t+1
rounding: ✏ = 12�
�t+1
This is called machine precision: maximum relative error.
32-bit single precision: mp ⇡ 10�7
64-bit double precision: mp ⇡ 10�16
Maximum attainable accuracy.
Another definition of machine precision: smallest number ✏ suchthat 1 + ✏ > 1.
Addition
1. align exponents
2. add mantissas
3. adjust exponent to normalize
Example: 1.00 + 2.00⇥ 10�2 = 1.00 + .02 = 1.02. This is exact,but what happens with 1.00 + 2.55⇥ 10�2?
Example: 5.00⇥ 101 + 5.04 = (5.00 + 0.504)⇥ 101 ! 5.50⇥ 101
Any error comes from truncating the mantissa: if x is the true sumand x̃ the computed sum, then x̃ = x(1 + ✏) with |✏| < 10�2
The ‘correctly rounded arithmetic’ model
Assumption (enforced by IEEE 754):
The numerical result of an operation is the rounding
of the exactly computed result.
fl(x1 � x2) = (x1 � x2)(1 + ✏)
where � = +,�, ⇤, /
Note: this holds only for a single operation!
Guard digits
Correctly rounding is not trivial, especially for subtraction.
Example: t = 2,� = 10: 1.0� 9.5⇥ 10�1, exact result0.05 = 5.0⇥ 10�2.
I Simple approach:1.0� 9.5⇥ 10�1 = 1.0� 0.9 = 0.1 = 1.0⇥ 10�1
I Using ‘guard digit’:1.0� 9.5⇥ 10�1 = 1.0� 0.95 = 0.05 = 5.0⇥ 10�2, exact.
In general 3 extra bits needed.
Error propagation under addition
Let us consider s = x1 + x2, and assume that xi
is represented byx̃
i
= x
i
(1 + ✏i
)
s̃ = (x̃1 + x̃2)(1 + ✏3)
= x1(1 + ✏1)(1 + ✏3) + x2(1 + ✏2)(1 + ✏3)
⇡ x1(1 + ✏1 + ✏3) + x2(1 + ✏2 + ✏3)
⇡ s(1 + 2✏)
) errors are added
Assumptions: all ✏i
approximately equal size and small;x
i
> 0
Multiplication
1. add exponents
2. multiply mantissas
3. adjust exponent
Example:.123 ⇥ .567⇥ 101 = .069741⇥ 101 ! .69741⇥ 100 ! .697⇥ 100.
What happens with relative errors?
Subtraction
Correct rounding only applies to a single operation.
Example: 1.24� 1.23 = .001 ! 1.⇥ 10�2:result is exact, but only one significant digit.
What if 1.24 = fl(1.244) and 1.23 = fl(1.225)? Correctresult 1.9⇥ 10�2; almost 100% error.
ICancellation leads to loss of precision
I subsequent operations with this result are inaccurate
I this can not be fixed with guard digits and such
I ) avoid subtracting numbers that are likely close.
Example: ax2 + bx + c = 0 ! x = �b±pb
2�4ac2a
suppose b > 0 and b
2 � 4ac then the ‘+’ solution will beinaccurateBetter: compute x� = �b�
pb
2�4ac2a and use x+ · x� = �c/a.
Serious example
Evaluate ⌃10000n=1
1n
2 = 1.644834in 6 digits: machine precision is 10�6 in single precision
First term is 1, so partial sums are � 1, so 1/n2 < 10�6 getsignored, ) last 7000 terms (or more) are ignored, ) sum is1.644725: 4 correct digits
Solution: sum in reverse order; exact result in single precisionWhy? Consider ratio of two terms:
n
2
(n � 1)2=
n
2
n
2 � 2n + 1=
1
1� 2/n + 1/n2⇡ 1 +
2
n
with aligned exponents:n � 1: .00 · · · 0 10 · · · 00
n: .00 · · · 0 10 · · · 01 0 · · · 0k = log(n/2) positions
The last digit in the smaller number is not lost if n < 2/✏
Stability of linear system solving
Problem: solve Ax = b, where b inexact.
A(x +�x) = b +�b.
Since Ax = b, we get A�x = �b. From this,
⇢�x = A
�1�b
Ax = b
�)
⇢kAkkxk � kbkk�xk kA�1kk�bk
) k�xkkxk kAkkA�1kk�bk
kbk‘Condition number’. Attainable accuracy depends on matrixproperties
Consequences of roundo↵
Multiplication and addition are not associative:problems for parallel computations.
Operations with “same” outcomes are not equally stable:matrix inversion is unstable, elimination is stable
Complex numbers
Two real numbers: real and imaginary part.
Storage:
I Store real/imaginary adjacent: easy to pass address of onenumber
I Store array of real, then array of imaginary. Better for stride 1access if only real parts are needed. Other considerations.