MATH4414 - An Introduction to Floating Point Arithmetic · PDF fileoating point number in hexadecimal. The rst three hex digits ... (decimal) oating-point, ... computation in an equivalent

MATH4414An Introduction to Floating Point Arithmetic

Pat Quillen

Boston College

30 January 2018

A motivating example

What is the value of

1− 3 ∗ (4/3− 1)

according to Matlab?

2.220446049250313e-016

Why?? Essentially because 4/3 cannot be represented exactly bya binary number with finitely many terms.By the way... this behavior is not specific to Matlab.

Example (continued)

Notice that4

3=

134

=1

1− 14

=∞∑k=0

1

4

k

That is,4

3= 1 +

1

22+

1

24+

1

26+ · · ·

or, in binary,4

3= 1.010101010101 · · ·

which, again, is not exactly representable by finitely many terms.

Floating Point Representation

In binary computers, most floating point numbers are representedas

(−1)s 2e (1 + f )

where

I s is represented by one bit (called the sign bit).

I e is the exponent.

I f is the mantissa.

For double precision numbers, e is an eleven bit number and f isa fifty-two bit number.

Floating Point Exponent

I As e is represented by 11 bits, it can range in value from 0 to211 − 1 = 2047.

I Negative exponents are represented by biasing e when stored.

I The double precision bias is 210 − 1 = 1023. Thus,−1023 ≤ e ≤ 1024.

I The extreme values e = −1023 (stored as eb = 0) ande = 1024 (stored as eb = 2047) are special, so−1022 ≤ e ≤ 1023 is the valid range of the exponent.

Floating Point Mantissa

I f limits the precision of the floating point number.

I 0 ≤ f < 1

I The format 2e (1 + f ) provides an implicitly stored 1, sodoubles actually have 53 bits of precision.

I 252f is an integer ⇒ gaps between successive doubles.

For example, all integers up to 253 are exactly representable asfloating point numbers, but 253 + 1 is not.

Examples

The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the biasedvalue of e is eb = 1023.

You can use format hex in Matlab to see the bit pattern of thefloating point number in hexadecimal. The first three hex digits(12 bits) represent the sign bit and the biased exponent, and theremaining 13 hex digits (52 bits) represent the mantissa.

Examples

I In the case of the number 1, s = 0 and eb = 01111111111, sothe first three hex digits are 001111111111 = 3ff so, 1 isrepresented by

3ff0000000000000

I For 43 , f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with 1,

43 has e = 0, and so it has representation

3ff5555555555555

which is just slightly smaller than 43 .

I The real number 0.1 has e = −4, andf = 0.10011001 · · · 10011010, and thus has representation

3fb999999999999a

which is just slightly larger than 0.1.

Round-off

Since fl(43

)6= 4

3 (where fl(x) stands for “the floating pointrepresentation of x”), we see the behavior

1− 3 ∗ (4/3− 1) 6= 0.

All of the operations except the division are performed withouterror, and the special value

ε = 2−52

is the result.ε is referred to as machine epsilon, or the unit-roundoff, and it isthe distance between 1 and the next largest floating point number.

Two Important Definitions

Suppose p∗ is an approximation to p.

DefinitionThe absolute error in the approximation is |p − p∗|.

DefinitionThe relative error in the approximation is |p−p

∗||p| , provided that

p 6= 0.

In a normed linear space, the absolute value | · | is replaced by anorm ‖ · ‖, but we’ll talk about this later.

Example

For example, if p = 0.4 and p∗ = 0.404, the absolute error in theapproximation is

|p − p∗| = |0.4− 0.404| = 0.004

while the relative error is

|p − p∗||p|

=|0.4− 0.404||0.4|

=0.004

0.4= 0.01

On the other hand, with p = 400 and p∗ = 404, we have anabsolute error of 4, but a relative error of 0.01.

Representation Error

Again, floating-point numbers are represented as

(−1)s2e(1 + f )

where s ∈ {0, 1}, e is some integer, and 0 ≤ f < 1.

Let’s suppose that a number p is represented exactly by

p = 2e(1 + f )

for some f with possibly infinitely many digits following the binarypoint. That is,

f = 0.b1b2 · · · bkbk+1 · · ·

where bi ∈ {0, 1}.


Suppose further that p∗ is an approximation to p obtained bysimply chopping off the last bits of f after some number (say k)bits. That is,

p∗ = 2e(1 + f ∗)

wheref ∗ = 0.b1b2 · · · bk

QuestionWhat are the absolute and relative errors in the approximation p∗

of p?


The absolute error here is

|p − p∗| = |2e(1 + f )− 2e(1 + f ∗)|= 2e |f − f ∗|= 2e |(0.b1b2 · · · bkbk+1 · · · )− (0.b1b2 · · · bk)|= 2e |0.0 · · · 0bk+1 · · · | ≤ 2e(2−k) = 2e−k .

In turn, then, the relative error would be

|p − p∗||p|

≤ 2e−k

|p|≤ 2e−k

2e= 2−k

since 2e ≤ |p| ≤ 2e+1.


In the case of rounding, the bound on absolute error is still

|p − p∗| = 2e |f − f ∗|

but this time |f − f ∗| ≤ 2−(k+1). This is because in the case ofrounding

f ∗ = 0.b1b2 · · · bk +

{0 bk+1 = 0

2−k bk+1 = 1

Therefore |p − p∗| ≤ 2e−(k+1) and |p−p∗|

|p| ≤ 2−(k+1).

Rounding Modes

IEEE specifies five rounding modes, two of which we justdiscussed:

I Round down: fl(x) is the largest floating-point number lessthan or equal to x .

I Round up: fl(x) is the smallest floating-point number greaterthan or equal to x .

I Round towards 0: If x > 0, this is round-down. If x < 0, thisis round-up. (This is chopping).

I Round to nearest-even: Usual rounding except ties broken byrounding to even. (e.g. 1.5 rounds up to 2.0, but 2.5 alsorounds to 2.0).

I Round to nearest-inf: Usual rounding, except ties broken byrounding up. (e.g. 1.5 rounds up to 2.0, -1.5 rounds up to-1.0).

Floating-point Arithmetic

The principle floating-point arithmetic operations are typicallyrepresented as ⊕,⊗, etc. and are defined as

x ⊕ y = fl (x + y)

where x , y are floating-point numbers, i.e. fl(x) == x .According to the IEEE standard, the result of an operation on twofloating-point numbers must be the correctly rounded value of theexact result.This means that

x ⊕ y = fl(x + y) = (x + y)(1 + δ)

where δ < ε or δ ≤ ε2 depending on the rounding mode.

Example

Compute 23 ⊕

34 using four-digit (decimal) chopping. Compute the

relative error in the approximation to 23 + 3

4 .

2

3⊕ 3

4= fl

(fl

(2

3

)+ fl

(3

4

))= fl (0.6666 + 0.7500)

= fl (1.4166)

= 1.416

Computing the relative error requires the exact value which is23 + 3

4 = 1712 and then we compute∣∣∣∣17

12− 1.416

∣∣∣∣ =

∣∣∣∣ 5

12− .416

∣∣∣∣

Example

∣∣∣∣ 5

12− .416

∣∣∣∣ =

∣∣∣∣ 5

12− 52

125

∣∣∣∣=

∣∣∣∣ 625

1500− 624

1500

∣∣∣∣=

1

1500

The relative error, therefore, is

115001712

=1

2125≈ 4.706× 10−4

Discussion

In four-digit (decimal) floating-point, ε = 10−3. Note that therelative error here, is in fact

1

2125<

1

1000= 10−3.

If instead, we consider rounding, we see that the relative error is∣∣∣∣ 5

12− .417

∣∣∣∣ =

∣∣∣∣ 5

12− 417

1000

∣∣∣∣=

∣∣∣∣1250

3000− 1251

3000

∣∣∣∣=

1

3000

which is less than 12000 .

Example

A very common example of propogation of round-off comes in theform of

0.1 + 0.1 + 0.1

Specifically, is the above expression equal to 0.3?

No! As a matter of fact, Matlab will tell you that 0.3 isrepresented by

3fd3333333333333

while 0.1 + 0.1 + 0.1 is represented by

3fd3333333333334

The difference in the last place is due to accumulation of thedifference between 0.1 and fl(0.1).

Deadly Consequences

Numerical Disasters1991: Patriot Missile misses Scud!

1996: Ariane Rocket explodes!

http://www.ima.umn.edu/~arnold/disasters

Swamping

Due to finiteness of precision, floating point addition can sufferswamping. Suppose we have two floating point numbers a = 105

and b = 10−12. The quantity c = a + b is equal to a, since a andb differ by many orders of magnitude. Notice that 1017 > 256 sothis shouldn’t be surprising.

To rectify the effects of swamping, one may compute in increasingorder of magnitude. For example, try these in Matlab:

eps/2 + 1− eps/2 eps/2− eps/2 + 1

Note: It is frequently infeasible to do this!

Cancellation

A phenomenon not dissimilar from swamping is cancellation, whichoccurs when a number is subtracted from another number ofrougly the same magnitude.

For example, for values of x very near 0, the expression

√x + 1− 1

suffers cancellation, as 1 swamps x in the computation of x + 1,and the subsuquent subtraction results in 0.

Cancellation

To get around the effects of cancellation, one may rewrite theircomputation in an equivalent form that avoids the cancellationaltogether. For example, computing with

√x + 1− 1 =

x√x + 1 + 1

avoids the cancellation for values of x near zero. Now, the onlyvalue of x that results in a zero output is 0 itself1.

Note: Not all cancellation can be avoided, and not all cancellationis bad!

1Except for the smallest subnormal number

Ill-conditioned problems

Ill-conditioning refers to the sensitivity of a problem toperturbation. That is, tiny changes in the input may cause extremechanges in the output.

The above picture shows eigenvalues of two matrices which differonly in their diagonal elements by about 10−16. Notice that theeigenvalues in the middle are almost a whole unit apart!

Resources

I What Every Computer Scientist Should Know AboutFloating-Point Arithmetic by David Goldberg. Available here.

I Numerical Methods by Anne Greenbaum and TimothyP. Chartier

I Numerical Computing with MATLAB by Cleve Moler.Available here.

I Accuracy and Stability of Numerical Algorithms by NicholasJ. Higham.

I Technical Note regarding Floating Point Arithmetic. Availablehere.

http://docs.sun.com/source/806-3568/ncg_goldberg.html

http://www-internal.mathworks.com/~moler

http://www.mathworks.com/support/tech-notes/1100/1108.html

Documents

MATH4414 - An Introduction to Floating Point Arithmetic · PDF fileoating point number in hexadecimal. The rst three hex digits ... (decimal) oating-point, ... computation in an equivalent