Upload
duongdiep
View
223
Download
0
Embed Size (px)
Citation preview
Systems Programming and Computer Architecture
(252-0061-00)
Session 9
Floating Point
© Systems Group | Department of Computer Science | ETH Zürich
Systems Programming and Computer Architecture
1
Floating Point
Recap for the Assignment
Systems Programming and Computer Architecture
2
Floating Point Representation
Numerical Form
• “Scientific Notation”– Sign s ∈ 0, 1
– Significant 𝑀 ∈ [1.0, 2.0)
– Exponent E
• 𝐹 = −1 𝑠 ⋅ 𝑀 ⋅ 2𝐸
Encoding
• “Bit Pattern”– MSB is sign bit s
– 𝑒𝑥𝑝 ≈ 𝐸, exp ≠ 𝐸
– 𝑓𝑟𝑎𝑐 ≈ 𝑀, exp ≠ 𝐸
Systems Programming and Computer Architecture
3
s exp frac
1 8 231 11 521 15 63/64
floatdoubleextended
Casting
Integer Types
• What happens here?
Floats
• What happens here?
Systems Programming and Computer Architecture
4
1. unsigned int foo;2. long bar = (long) foo;
1. int i;2. long long l;3. float f;4. double d;
5. i = (int) f;6. i = (int) d;7. f = (float) d;8. d = (double) i;9. f = (float) f;
Floats <-> Integers
• Casting between floats, doubles and integers generally changes the bit representation!
Systems Programming and Computer Architecture
5
1. int i = 0xABCDABCD;2. float f = (float)i;3.
4. int *i2 = (int *)&f;5.
6. printf(“%x, %x, I, *i2);7.
8. // Prints// abcdabcd, cea864a8
Floats <-> Integers
Systems Programming and Computer Architecture
6
From To Descrption
double/float
float f=1.12345;float f2=1.999999;
int
(int)f = ?(int)f2 = ?
Truncates the fractional part,Out of range, NaN -> TMin
int
long longl=0x7FFFFFFFFFFFFFFF;long long l2=0xFFFFFFFF;
double
double d = (double)l;double d2 = (double)l2;
In general exact conversion iff int < 54 bitsl == (long long)d; l2 == (long long)d2;
int Float Will round according to roundingmode
float f2=1.50f;float f3=1.50f;
printf("%f, %i, %i\n", f2+f3, (int)(f2+f3), (int)f2 + (int)f3);// 3.00000 3 2
Normalized / Denormalized
• Normalized: exp != {000…0, 111…1}
– Good for bigger values
– Not equi-spaced
• Denormalized: exp == 000…0
– Good for very small values
– Equi-spaced [-1 + eps, 1 - eps]
– And zero
Systems Programming and Computer Architecture
7
Exponent
• There must be a way to express negative exponents -> Encode as biased value– E = Exp – Bias
• Bias = 2e-1 - 1:– For Single precision?
– For Double precision?
• Exponent in general never all zeros and all ones!
Systems Programming and Computer Architecture
8
NORMALIZED!
Significant
• We know that
– We always have one leading 1…
• Remove that leading 1 to stave one bit!
• What are the max and min values for the significant?
Systems Programming and Computer Architecture
9
𝑀 ∈ [1.0, 2.0)
NORMALIZED!
Exponent
• There must be a way to express values very close to 0: exponent must be as negative as possible.
• Exp is all zero and the exponent is evaluated as
– E = - Bias + 1
Systems Programming and Computer Architecture
10
DENORMALIZED
Significant
• We are close to zero:
– We always have one leading 0…
Systems Programming and Computer Architecture
11
𝑀 ∈ [0.0, 1.0)
DENORMALIZED
Special Values
Fraction Exponent Description
000…0 111…1 Infinity (+ / -)If an operation overflows
!= 000…0 111…1 Not-a-Number (NaN)No numeric value can be determinedsqrt(-1)
000…0 000…0 Zero is in fact all zero like integer zero(there is also a -0 in float)
Systems Programming and Computer Architecture
12
-0 ?
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowednegative number from 0, and would therefore have to return -∞.
Systems Programming and Computer Architecture
13
Tiny floating point example
• 8-bit floating point representation– the sign bit is in the most significant bit.– the next four bits are the exponent, with a bias of 7.– the last three bits are the frac
• Same general form as IEEE Format– normalized, denormalized– representation of 0, NaN, infinity
s exp frac
1 4 3 Typical exam question
14
Conversion
• Step 1: Normalize the Numbers
• Step 2: Round to fit in fraction
• Step 3: Post-normalize to deal with rounding effects
Systems Programming and Computer Architecture
15
Value Binary
128 1000 0000
15 0000 1101
Define 15 to be 13, i.e. 15 := 13
Conversion
• Step 1: Normalize the Numbers
– Set binary point s.t. has leading 1
– Start with bias exponent = 7, decrement if need to left shift
Systems Programming and Computer Architecture
16
Value Binary Fraction Exponent
128 1000 0000 1.0000 0000 7 (no shift)
15 0000 1101 1.1010 0000 3 (4 shift)
Conversion
• Step 2: Round to fit in fraction
– We have 3 bit fractions
Systems Programming and Computer Architecture
17
Value Fraction GRS Rounded
128 1.0000000 000 1.000
15 1.1010000 100 1.101
Conversion
• Step 3: Post-normalize to deal with rounding effects
– Overflow in fraction due to rounding? (Not here)
– Shift right and increment exponent
Systems Programming and Computer Architecture
18
Value Binary
128 1000 0000
15 0000 1101
A possible Exam Question?
Systems Programming and Computer Architecture
19
• You have a 8 bit floating point representation with 3 fraction bits.
– Give the floating point representations of
• 138
• 63
Multiplication
• Exact result:
– 𝐹𝑛𝑒𝑤 = −1 𝑠1⊗𝑠2 𝑀1 ⋅ 𝑀2 ⋅ 2𝐸1+𝐸2
– while( 𝑀1 ⋅ 𝑀2 ≥ 2 ) {M=M>>1; E++}
– Round M to fit fraction bits
– Check if exponent still in range
Systems Programming and Computer Architecture
20
Addition
• Signed align and add (Assume E1 > E2)
– Shift the first operand by the difference of their exponents
– Add the M and s bits
– Apply shift and exponent adjustments till M is in 1.0…2.0
– Round
Systems Programming and Computer Architecture
21
Systems Programming and Computer Architecture
22http://meseec.ce.rit.edu/eecc250-winter99/250-1-27-2000.pdf
• What Every Computer Scientist Should Know About Floating-Point Arithmetic:
• http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Systems Programming and Computer Architecture
23
Assignment 08
Floating Point
Systems Programming and Computer Architecture
24
Now its your turn!
• Implement your floating point handler in C!
– No use of floats/doubles
– Use the given skeleton
Systems Programming and Computer Architecture
25
Your float_t
• You will represent the float as a struct
• Challenge: Can you use bit fields for this and simply cast the pointer?
Systems Programming and Computer Architecture
26
1. typedef struct float_t { 2. uint8_t sign;3. uint8_t exponent;4. uint32_t mantissa;5. };
Conversion
• The only time you are allowed to use floats is when conversion it to your float_t
Systems Programming and Computer Architecture
27
1. float_t fp_encode(float x);
2. float fp_decode(float_t x);
Approach
• Create some float numbers and convert them into your float_t. Choose good representatives
• Do some add, multiply, negations with your implemented functions and with the floats
• Compare at the end.
Systems Programming and Computer Architecture
28
Approach Example
Systems Programming and Computer Architecture
29
1. void main() {2. float f1 = 1.123;3. float f2 = 550;4. float f3;5. float_t ft1 = fp_encode(f1);6. float_t ft2 = fp_encode(f2);7. float_t ft28. float_t ft3;9.
10. f3 = f1+f2;11. ft3 = fp_add(ft1, ft2); 12.
13. assert(f3 == fp_decode(ft3));14. }
Submission
• Once you committed your final solution, write an e-mail to me!– Subect: [CASP] Submission
– Content: Briefly describe what is working / what is not working
– Make sure your solution compiles! (with –Wall)
– You can also submit your last homework
Systems Programming and Computer Architecture
31
• Have a nice weekend…
Systems Programming and Computer Architecture
32