Systems Programming and Computer Architecture - ETH · Floating Point Recap for the Assignment Systems Programming and Computer Architecture 2

Systems Programming and Computer Architecture

(252-0061-00)

Session 9

Floating Point

© Systems Group | Department of Computer Science | ETH Zürich


1

Floating Point

Recap for the Assignment


2

Floating Point Representation

Numerical Form

• “Scientific Notation”– Sign s ∈ 0, 1

– Significant 𝑀 ∈ [1.0, 2.0)

– Exponent E

• 𝐹 = −1 𝑠 ⋅ 𝑀 ⋅ 2𝐸

Encoding

• “Bit Pattern”– MSB is sign bit s

– 𝑒𝑥𝑝 ≈ 𝐸, exp ≠ 𝐸

– 𝑓𝑟𝑎𝑐 ≈ 𝑀, exp ≠ 𝐸


3

s exp frac

1 8 231 11 521 15 63/64

floatdoubleextended

Casting

Integer Types

• What happens here?

Floats

• What happens here?


4

1. unsigned int foo;2. long bar = (long) foo;

1. int i;2. long long l;3. float f;4. double d;

5. i = (int) f;6. i = (int) d;7. f = (float) d;8. d = (double) i;9. f = (float) f;

Floats <-> Integers

• Casting between floats, doubles and integers generally changes the bit representation!


5

1. int i = 0xABCDABCD;2. float f = (float)i;3.

4. int *i2 = (int *)&f;5.

6. printf(“%x, %x, I, *i2);7.

8. // Prints// abcdabcd, cea864a8

Floats <-> Integers


6

From To Descrption

double/float

float f=1.12345;float f2=1.999999;

int

(int)f = ?(int)f2 = ?

Truncates the fractional part,Out of range, NaN -> TMin

int

long longl=0x7FFFFFFFFFFFFFFF;long long l2=0xFFFFFFFF;

double

double d = (double)l;double d2 = (double)l2;

In general exact conversion iff int < 54 bitsl == (long long)d; l2 == (long long)d2;

int Float Will round according to roundingmode

float f2=1.50f;float f3=1.50f;

printf("%f, %i, %i\n", f2+f3, (int)(f2+f3), (int)f2 + (int)f3);// 3.00000 3 2

Normalized / Denormalized

• Normalized: exp != {000…0, 111…1}

– Good for bigger values

– Not equi-spaced

• Denormalized: exp == 000…0

– Good for very small values

– Equi-spaced [-1 + eps, 1 - eps]

– And zero


7

Exponent

• There must be a way to express negative exponents -> Encode as biased value– E = Exp – Bias

• Bias = 2e-1 - 1:– For Single precision?

– For Double precision?

• Exponent in general never all zeros and all ones!


8

NORMALIZED!

Significant

• We know that

– We always have one leading 1…

• Remove that leading 1 to stave one bit!

• What are the max and min values for the significant?


9

𝑀 ∈ [1.0, 2.0)

NORMALIZED!

Exponent

• There must be a way to express values very close to 0: exponent must be as negative as possible.

• Exp is all zero and the exponent is evaluated as

– E = - Bias + 1


10

DENORMALIZED

Significant

• We are close to zero:

– We always have one leading 0…


11

𝑀 ∈ [0.0, 1.0)

DENORMALIZED

Special Values

Fraction Exponent Description

000…0 111…1 Infinity (+ / -)If an operation overflows

!= 000…0 111…1 Not-a-Number (NaN)No numeric value can be determinedsqrt(-1)

000…0 000…0 Zero is in fact all zero like integer zero(there is also a -0 in float)


12

-0 ?

In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowednegative number from 0, and would therefore have to return -∞.


13

Tiny floating point example

• 8-bit floating point representation– the sign bit is in the most significant bit.– the next four bits are the exponent, with a bias of 7.– the last three bits are the frac

• Same general form as IEEE Format– normalized, denormalized– representation of 0, NaN, infinity

s exp frac

1 4 3 Typical exam question

14

Conversion

• Step 1: Normalize the Numbers

• Step 2: Round to fit in fraction

• Step 3: Post-normalize to deal with rounding effects


15

Value Binary

128 1000 0000

15 0000 1101

Define 15 to be 13, i.e. 15 := 13

Conversion

• Step 1: Normalize the Numbers

– Set binary point s.t. has leading 1

– Start with bias exponent = 7, decrement if need to left shift


16

Value Binary Fraction Exponent

128 1000 0000 1.0000 0000 7 (no shift)

15 0000 1101 1.1010 0000 3 (4 shift)

Conversion

• Step 2: Round to fit in fraction

– We have 3 bit fractions


17

Value Fraction GRS Rounded

128 1.0000000 000 1.000

15 1.1010000 100 1.101

Conversion

• Step 3: Post-normalize to deal with rounding effects

– Overflow in fraction due to rounding? (Not here)

– Shift right and increment exponent


18

Value Binary

128 1000 0000

15 0000 1101

A possible Exam Question?


19

• You have a 8 bit floating point representation with 3 fraction bits.

– Give the floating point representations of

• 138

• 63

Multiplication

• Exact result:

– 𝐹𝑛𝑒𝑤 = −1 𝑠1⊗𝑠2 𝑀1 ⋅ 𝑀2 ⋅ 2𝐸1+𝐸2

– while( 𝑀1 ⋅ 𝑀2 ≥ 2 ) {M=M>>1; E++}

– Round M to fit fraction bits

– Check if exponent still in range


20

Addition

• Signed align and add (Assume E1 > E2)

– Shift the first operand by the difference of their exponents

– Add the M and s bits

– Apply shift and exponent adjustments till M is in 1.0…2.0

– Round


21


22http://meseec.ce.rit.edu/eecc250-winter99/250-1-27-2000.pdf

http://meseec.ce.rit.edu/eecc250-winter99/250-1-27-2000.pdf

• What Every Computer Scientist Should Know About Floating-Point Arithmetic:

• http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html


23

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Assignment 08

Floating Point


24

Now its your turn!

• Implement your floating point handler in C!

– No use of floats/doubles

– Use the given skeleton


25

Your float_t

• You will represent the float as a struct

• Challenge: Can you use bit fields for this and simply cast the pointer?


26

1. typedef struct float_t { 2. uint8_t sign;3. uint8_t exponent;4. uint32_t mantissa;5. };

Conversion

• The only time you are allowed to use floats is when conversion it to your float_t


27

1. float_t fp_encode(float x);

2. float fp_decode(float_t x);

Approach

• Create some float numbers and convert them into your float_t. Choose good representatives

• Do some add, multiply, negations with your implemented functions and with the floats

• Compare at the end.


28

Approach Example


29

1. void main() {2. float f1 = 1.123;3. float f2 = 550;4. float f3;5. float_t ft1 = fp_encode(f1);6. float_t ft2 = fp_encode(f2);7. float_t ft28. float_t ft3;9.

10. f3 = f1+f2;11. ft3 = fp_add(ft1, ft2); 12.

13. assert(f3 == fp_decode(ft3));14. }

Submission

• Once you committed your final solution, write an e-mail to me!– Subect: [CASP] Submission

– Content: Briefly describe what is working / what is not working

– Make sure your solution compiles! (with –Wall)

– You can also submit your last homework


31

• Have a nice weekend…


32

Documents

Systems Programming and Computer Architecture - ETH · Floating Point Recap for the Assignment Systems Programming and Computer Architecture 2