Floating Point Arithmetic 2

8/2/2019 Floating Point Arithmetic 2

http://slidepdf.com/reader/full/floating-point-arithmetic-2 1/23



A method of representation of real numbersthat can support a wide range of values. Atypical number that can be represented exactly

is of the form:

Significant digits × baseexponent

The term floating point refers to the fact that theradix point can "float" i.e., it can be placedanywhere relative to the significant digits of thenumber.



Floating point numbers approximate realnumbers

Floating numbers have large dynamic range



The IEEE 754 has produced a standard

for floating point arithmetic. This standardspecifies how single precision (32 bit) and

double precision (64 bit) floating point

numbers are to be represented, as well as

how arithmetic should be carried out on

them



The IEEE 754 standard specifies a binary32 as

having:• Sign bit: 1 bit

• Exponent width: 8 bits

• Significand precision: 24 (23 explicitly stored)

The base is 2



Sign bit determines the sign of the number,which is the sign of the significand as well.

Sign bit=0 if the number is positive=1 if the number is negative



The exponent field needs to represent bothpositive and negative exponents. To do this, abias of ‘127’ is added to the actual exponent in

order to get the stored exponent.Thus, an exponent of zero means that 127 is

stored in the exponent field. A stored value of

200 indicates an exponent of (200-127), or 73.Exponents of -127 (all 0s) and +128 (all 1s) are

reserved for special numbers.



Also known as ‘Mantissa’

The true significand includes 23 fraction bitsto the right of the binary point and an implicitleading bit with value 1 unless the exponent isstored with all zeros. Thus only 23 fractionbits of the significand appear in the memoryformat but the total precision is 24 bits



The bits are laid out as follows:

31 30 23 22 0

sign exponent significand



The value of the number represented in singleprecision format is as follows:

(a)If e=255 and f=0, then v= NaN.(b) If e=255 and f=0, then v= (- I)s (c) If 0<e<255, then v=(- 1)s2e-127 (1. f).(d) If e =0 and f=0, then v = ( - 1)s2 -126(0.f).(e) If e=0 and f=0, then v=(- l)s 0, (zero).



In order to maximize the quantity of

representable numbers, floating-pointnumbers are typically stored in normalized form. This basically puts the radix point afterthe first non-zero digit. In normalized form,five is represented as 5.0 × 100.



A nice little optimization is available to us inbase two, since the only possible non-zero

digit is 1. Thus, we can just assume a leadingdigit of 1, and don't need to represent itexplicitly. As a result, the mantissa haseffectively 24 bits of resolution, by way of 23fraction bits.



The storage format of double precision is asshown

sign bit: 1 bitExponent width:11 bitssignificand precision: 52 bits(implicit)

The bias for exponent is 1023

63 62 52 51 0

Sign exponent significand



Convert the following single-precision IEEE 754number into a floating-point decimal value.

1 10000001 10110011001100110011010 First, put the bits in three groups.

Bit ‘31’ (the leftmost bit) show the sign of thenumber.Bits ‘23-30’ (the next 8 bits) are the exponent. Bits ‘0-22’ (on the right) give the fraction



Now, look at the sign bit.

If this bit is a 1, the number is negative, otherwise positive.Here this bit is 1, so the number is negative.

Get the exponent and the correct bias. The exponent is simply a positive binary number.10000001bin = 129ten

Remember that we will have to subtract a bias fromthis exponent to find the power of 2. Since this is asingle-precision number, the bias is 127.



Convert the fraction string into base ten.This is the trickiest step. The binary string

represents a fraction, so conversion is a littledifferent.Binary fractions look like this:

0.1 = (1/2) = 2-1 0.01 = (1/4) = 2-2 0.001 = (1/8) = 2-3



So, for this example, we multiply each digit by thecorresponding power of 2:

0.10110011001100110011010bin = 1*2-1

+ 0*2-2

+ 1*2-3

+1*2-4 + 0*2-5 + 0 * 2-6 + ...0.10110011001100110011010bin = 1/2 + 1/8 + 1/16 + ...

Note that this number is just an approximation onsome decimal number. There will most likely be someerror. In this case, the fraction is about0.7000000476837158.



This is all the information we need. We canput these numbers in the expression:

(-1)sign bit * (1+fraction) * 2 exponent - bias = (-1)1 * (1.7000000476837158) * 2 129-127 = -6.8

The answer is approximately -6.8.



Convert 0.1015625 to IEEE 32-bit floating pointformat. Converting:

0.1015625 × 2 = 0.203125 0 Generate 0 and continue.

0.203125 × 2 = 0.40625 0 Generate 0 and continue. 0.40625 × 2 = 0.8125 0 Generate 0 and continue.

0.8125 × 2 = 1.625 1 Generate 1 and continue with the rest.

0.625 × 2 = 1.25 1 Generate 1 and continue with the rest. 0.25 × 2 = 0.5 0 Generate 0 and continue.

0.5 × 2 = 1.0 1 Generate 1 and nothing remains. So 0.101562510 = 0.00011012.



Normalize: 0.00011012 = 1.1012 × 2-4. Mantissa is 10100000000000000000000,

exponent is -4 + 127 = 123 = 011110112, signbit is 0. So 0.1015625 is

00111101110100000000000000000000



Binary Fractional Numbers “Even” when least significant bit is 0 Half way when bits to right of rounding position =

100…2

Examples Round to nearest 1/4 (2 bits right of binary point)Value Binary Rounded Action Rounded

Value

2 3/32 10.000112 10.002 (<1/2—down) 22 3/16 10.001102 10.012 (>1/2—up) 2 1/42 7/8 10.111002 11.002 (1/2—up) 32 5/8 10.101002 10.102 (1/2—down) 2 1/2



Operands( – 1)s1 M1 2E1

( – 1)s2 M2 2E2

Assume E1 > E2

Exact Result( – 1)s M 2E Sign s, significand M:

▪ Result of signed align & add

Exponent E : E1 Fixing

If M ≥ 2, shift M right, increment E if M < 1, shift M left k positions, decrement E by k Overflow if E out of range Round M to fit frac precision

( – 1)s1 m1

( – 1)s2 m2

E1–E2

+

( –

1)s

m



3.25 x 10 ** 3+ 2.63 x 10 ** -1

-----------------

first step: align decimal pointssecond step: add

3.25 x 10 ** 3+ 0.000263 x 10 ** 3

--------------------= 3.250263 x 10 ** 3

Documents

Floating Point Arithmetic 2