Spring 20131 Floating Point Computation Jyun-Ming Chen

Spring 2013 1

Floating Point Computation

Jyun-Ming Chen

Spring 2013 2

Contents

• Sources of Computational Error

• Computer Representation of (floating-point) Numbers

• Efficiency Issues

Spring 2013 3

Sources of Computational Error

• Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources:– round off error (limited

precision of representation)

– truncation error (limited time for computation)

• Misc.– Error in original data

– Blunder: to make a mistake through stupidity, ignorance, or carelessness; programming/data input error

– Propagated error

Spring 2013 4

Supplement: Error Classification (Hildebrand)

• Gross error: caused by human or mechanical mistakes

• Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification.

• Truncation error: any error which is neither a gross error nor a roundoff error.

• Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps.

Spring 2013 5

Common Measures of Error

• Definitions– total error = round off + truncation– Absolute error = | numerical – exact |– Relative error = Abs. error / | exact |

• If exact is zero, rel. error is not defined

Spring 2013 6

Ex: Round off error

Representation consists of finite number of digits

The approximation of real-number on the number line is discrete!

R

Spring 2013 7

Watch out for printf !!

• By default, “%f” prints out 6 digits behind decimal point.

Spring 2013 8

Ex: Numerical Differentiation

• Evaluating first derivative of f(x)

hxf

xfh

xfhxfxf

xfhxfxfhxf

hxfhxf

h

h

smallfor ,)('

)(")()(

)('

)(")(')()(

)()(

2

2

2

Truncationerror

Spring 2013 9

Numerical Differentiation (cont)

• Select a problem with known answer– So that we can evaluate the error!

300)10('

3)(')( 23

f

xxfxxf

Spring 2013 10

Numerical Differentiation (cont)

• Error analysis– h (truncation) error

• What happened at h = 0.00001?!

Spring 2013 11

Ex: Polynomial Deflation

• F(x) is a polynomial with 20 real roots

• Use any method to numerically solve a root, then deflate the polynomial to 19th degree

• Solve another root, and deflate again, and again, …

• The accuracy of the roots obtained is getting worse each time due to error propagation

)20()2)(1()( xxxxf

Spring 2013 12

Computer Representation of Floating Point Numbers

Decimal-binary conversion

Floating point VS. fixed point

Standard: IEEE 754 (1985)

Spring 2013 13

Decimal-Binary Conversion• Ex: 29(base 10)

0

12mod1

221

12mod3

2223

12mod7

22227

02mod14

2222214

12mod29

22222229

65

4

04

1510

3

03

14

2510

2

02

13

24

3510

1

01

12

23

34

4510

0

00

11

22

33

44

5510

aa

a

aa

a

aaa

a

aaaa

a

aaaaa

a

aaaaaa

2)29 2)14 12) 7 02) 3 12) 1 12) 0 1

2910=111012

Spring 2013 14

Fraction Binary Conversion

• Ex: 0.625 (base 10)

25

14

0310

35

24

13

0210

45

34

23

12

0110

55

44

33

22

1110

222 000.1

2222 500.0

22222250.1

22222625.0

aaa

aaaa

aaaaa

aaaaa2

a1=12

a2=1

a3=1 a4= a5=…=0

Spring 2013 15

• Computing: • How about 0.110?0.625

1.2502

20.500

21.000

0.62510 = 0.1012

0.110 = 0.000112

Spring 2013 16

Floating VS. Fixed Point

• Decimal, 6 digits (positive number)– fixed point: with 5 digits after decimal point

• 0.00001, … , 9.99999

– Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy)

• 0.001x1000, … , 9.999x1099

• Comparison:– Fixed point: fixed accuracy; simple math for

computation (used in systems w/o FPU)– Floating point: trade accuracy for larger range of

representation

Spring 2013 17

Floating Point Representation

• Fraction, f– Usually normalized so that

• Base, – 2 for personal computers– 16 for mainframe– …

• Exponent, e

ef

f0.1

Spring 2013 18

IEEE 754-1985

• Purpose: make floating system portable

• Defines: the number representation, how calculation performed, exceptions, …

• Single-precision (32-bit)

• Double-precision (64-bit)

Spring 2013 19

Number Representation

• S: sign of mantissa• Range (roughly)

– Single: 10-38 to 1038

– Double: 10-307 to 10307

• Precision (roughly) – Single: 7-8 significant

decimal digits

– Double: 15 significant decimal digits

308

10242

10

25.3082log1024log

22111

p

p

p

Spring 2013 20

Significant Digits

• In binary sense, 24 bits are significant (with implicit one – next page)

• In decimal sense, roughly 7-8 decimal significant digits

• When you write your program, make sure the results you printed carry the meaningful significant digits.

1

2-23

Spring 2013 21

Implicit One

• Normalized mantissa always 1.0– Only store the fractional part to increase one

extra bit of precision

• Ex: 3.5

12 211.11.115.0125.3

Spring 2013 22

Exponent Bias

• Ex: in single precision, exponent has 8 bits– 0000 0000 (0) to 1111 1111 (255)

• Add an offset to represent +/ – numbers– Effective exponent = biased exponent – bias– Bias value: 32-bit (127); 64-bit (1023)– Ex: 32-bit

• 1000 0000 (128): effective exp.=128-127=1

Spring 2013 23

Ex: Convert – 3.5 to 32-bit FP Number

1271281

2

211.1211.1

1.115.0125.3

05.3 1 s

210000000128 e

2000...1100 m

00000000 00000000 01100000 11000000

Spring 2013 24

Examine Bits of FP Numbers

• Explain how this program works

Spring 2013 25

The “Examiner”

• Use the previous program to – Observe how ME work– Test subnormal behaviors on your

computer/compiler– Convince yourself why the subtraction of two

nearly equal numbers produce lots of error– NaN: Not-a-Number !?

Spring 2013 26

Design Philosophy of IEEE 754

• [s|e|m]• S first: whether the number is +/- can be tested

easily• E before M: simplify sorting• Represent negative by bias (not 2’s complement)

for ease of sorting– [biased rep] –1, 0, 1 = 126, 127, 128– [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01

• More complicated math for sorting, increment/decrement

Spring 2013 27

Exceptions

• Overflow: – ±INF: when number exceeds the range of representation

• Underflow– When the number are too close to zero, they are treated

as zeroes

• Dwarf– The smallest representable number in the FP system

• Machine Epsilon (ME)– A number with computation significance (more later)

Spring 2013 28

Extremities

• E : (1…1) – M (0…0): infinity– M not all zeros; NaN (Not a Number)

• E : (0…0)– M (0…0): clean zero– M not all zero: dirty zero (see next page)

More later

Spring 2013 29

Not-a-Number

• Numerical exceptions– Sqrt of a negative number– Invalid domain of trigonometric functions– …

• Often cause program to stop running

Spring 2013 30

Extremities (32-bit)

• Max:

• Min (w/o stepping into dirty-zero)

11111111111111111111111011111110

1.

00000000000000000000000100000000

1.

(1.111…1)2254-127=(10-0.000…1) 21272128

(1.000…0)21-127=2-126

Spring 2013 31

Dirty-Zero (a.k.a. denormals)

• No “Implicit One”• IEEE 754 did not specify compatibility for

denormals• If you are not sure how to handle them, stay

away from them. Scale your problem properly– “Many problems can be solved by pretending

as if they do not exist”

a.k.a.: also known as

Spring 2013 32

Dirty-Zero (cont)

00000000 10000000 00000000 00000000

00000000 01000000 00000000 0000000000000000 00100000 00000000 0000000000000000 00010000 00000000 00000000

2-126

2-127

2-128

2-129

(Dwarf: the smallest representable)

R

0 2-126

denormals

dwarf

Spring 2013 33

Drawf (32-bit)

Value: 2-149Value: 2-149

Spring 2013 34

Machine Epsilon (ME)

• Definition– smallest non-zero number that makes a

difference when added to 1.0 on your working platform

• This is not the same as the dwarf

Spring 2013 35

Computing ME (32-bit)

1+epsGetting closer to 1.0

ME: (00111111 10000000 00000000 00000001)–1.0

= 2-23 1.12 10-7

Spring 2013 36

Effect of ME

Spring 2013 37

Significance of ME

• Never terminate the iteration on that 2 FP numbers are equal.

• Instead, test whether |x-y| < ME

Machine Epsilon (Wikipedia)

Spring 2013 38

Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic.

Spring 2013 39

Numerical Scaling

• Number density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512]

• Revisit:– “roundoff” error– ME: a measure of real

number density near 1.0

• Implication:– Scale your problem so

that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest)

R

Spring 2013 40

Scaling (cont)

• Performing computation on denser portions of real line minimizes the roundoff error– but don’t over do it; switch to double precision

will easily increase the precision– The densest part is near subnormal, if density is

defined as numbers per unit length

Spring 2013 41

How Subtraction is Performed on Your PC

• Steps: – convert to Base 2– Equalize the exponents by adjusting the

mantissa values; truncate the values that do not fit

– Subtract mantissa– normalize

Spring 2013 42

Subtraction of Nearly Equal Numbers

• Base 10: 1.24446 – 1.24445

1.

111011101000111010100…

–Significant loss of accuracy (most bits are unreliable)

Spring 2013 43

Theorem of Loss Precision

• x, y be normalized floating point machine numbers, and x>y>0

• If then at most p, at least q significant binary bits are lost in the subtraction of x-y.

• Interpretation:– “When two numbers are very close, their

subtraction introduces a lot of numerical error.”

qp

x

y 212

Spring 2013 44

Implications

• When you program: • You should write these instead:

11)( 2 xxf1111

1122

2

2

2

)11()(

x

x

x

xxxf

1)ln()( xxg )ln()ln()ln()(e

xexxg

Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible

Spring 2013 45

Efficiency Issues

• Horner Scheme

• program examples

Spring 2013 46

Horner Scheme

• For polynomial evaluation

• Compare efficiency

Spring 2013 47

Accuracy vs. Efficiency

Spring 2013 48

Good Coding Practice

Spring 2013 49

Storing Multidimensional Array in Linear Memory

C and others

Fortran, MATLAB

Spring 2013 50

On Accessing Arrays …

Which one is more

efficient?

Spring 2013 51

Issues of PI

• 3.14 is often not accurate enough– 4.0*atan(1.0) is a good substitute

Spring 2013 52

Compare:

Spring 2013 53

Exercise

• Explain why

• Explain why converge when implemented numerically

000,101.0000,100

0

i

4

1

3

1

2

11

1

1n n

Spring 2013 54

Exercise

• Why Me( ) does not work as advertised?

• Construct the 64-bit version of everything– Bit-Examiner– Dme( );

• 32-bit: int and float. Can every int be represented by float (if converted)?

Spring 2013 55

Understanding Your Platform

1

448

48

4

2

Memory word: 4 bytes on 32-bit machines

Spring 2013 56

PaddingHow about

Spring 2013 57

Data Alignment (data structure padding)

• Padding is only inserted when a structure member is followed by a member with a larger alignment requirement or at the end of the structure.

• Alignment requirement:

Spring 2013 58

Ex: Padding

// for Data2 to align on a 2-byte boundary

// no padding required; already on 4-byte boundary

// final padding to align a 4-byte boundary

sizeof (struct MixedData) = 12 bytes

Spring 2013 59

Data Alignment (cont)

• By changing the ordering of members in a structure, it is possible to change the amount of padding required to maintain alignment.

• Direct the compiler to ignore data alignment (align it on a 1-byte boundary)

Push current alignment to stack

Spring 2013 60

#include <stdio.h>

struct pad1 {char data1;short data2; int data3;char data4;

};

struct pad2 {int data3;short data2; char data1;char data4;

};

#pragma pack(push)#pragma pack(1)struct pad3 {

char data1;short data2; int data3;char data4;

};#pragma pack(pop)

main(){

printf ("pad1 size: %d\n", sizeof (struct pad1));printf ("pad2 size: %d\n", sizeof (struct pad2));printf ("pad3 size: %d\n", sizeof (struct pad3));

}

1288

Documents

Spring 20131 Floating Point Computation Jyun-Ming Chen