Philippe Langlois, Nicolas Louvet Equipe DALI, Universit´e ... · Outline 1 Introduction 2 Principle of Compensated Algorithms 3 Sketch of proof for the compensated Horner algorithm

Compensated Algorithms

Philippe Langlois, Nicolas Louvet

Equipe DALI, Universite de Perpignan

http://webdali.univ-perp.fr

Compensated Algorithms – N. Louvet April 18, 2007 1 / 33

http://webdali.univ-perp.fr

Motivation

Results computed in floating point arithmetic are possibly corrupted by

rounding errors.

Compensated algorithms = algorithms that correct the rounding errors

generated during the computation:

If r is a computed result, how to find a correcting term c such that

r = r ⊕ c is more accurate than r ?

Aim of this presentation is:

I to recall the principle of so-called compensated algorithms,

I to present some details about the compensated Horner algorithm 1.

Context: IEEE-754 fp arithmetique, rounding to the nearest, no underflow.

1S. Graillat, Ph. Langlois, N. Louvet. Compensated Horner Scheme, Research Report

DALI-LP2A, Universite de Perpignan, aug. 2005 (submitted)Compensated Algorithms – N. Louvet April 18, 2007 2 / 33

Outline

1 Introduction

2 Principle of Compensated Algorithms

3 Sketch of proof for the compensated Horner algorithm

4 Conclusion

5 More slides

Faithful rounding with the CHS

Practical efficiency


Introduction

Why do we need compensated algorithms ?


Backward stable algorithms vs. condition number

Backward stable algorithms:

The accuracy of the computed solution satisfies

accuracy . condition number× u,

where

I u is the computing precision:

IEEE-754 double, 53-bits mantissa, rounding to the nearest ⇒ u = 2−53.

1 1+

u

I the condition number quantify the difficulty to solve the problem accuratly.

Examples: summation, dot product, Horner algorithm, substitution for

triangular system solving.


Accuracy of the Horner scheme

We consider the polynomial

p(x) =n∑

i=0

aixi ,

with ai ∈ F, x ∈ F

Algorithm (Horner scheme)function r0 = Horner (p, x)

rn = an

for i = n − 1 : −1 : 0

r i = r i+1 ⊗ x ⊕ ai

end

Relative accuracy of the evaluation with the Horner scheme:

|Horner (p, x)− p(x)||p(x)|

≤ γ2n︸︷︷︸≈2nu

cond(p, x)

u is the computing precision

cond(p, x) denotes the condition number of the evaluation:

cond(p, x) =

∑|aix

i ||p(x)|

≥ 1

.


Accuracy of the Horner scheme

We consider the polynomial

p(x) =n∑

i=0

aixi ,

with ai ∈ F, x ∈ F


rn = an

for i = n − 1 : −1 : 0

r i = r i+1 ⊗ x ⊕ ai

end

Relative accuracy of the evaluation with the Horner scheme:

|Horner (p, x)− p(x)||p(x)|

≤ γ2n︸︷︷︸≈2nu

cond(p, x)

u is the computing precision

cond(p, x) denotes the condition number of the evaluation:

cond(p, x) =

∑|aix

i ||p(x)|

≥ 1

.Compensated Algorithms – N. Louvet April 18, 2007 6 / 33

Accuracy . condition number of the problem × u

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

1

100000 1e+10 1e+15 1e+20 1e+25

rela

tive

forw

ard

erro

r

condition number

Accuracy of the polynomial evaluation [n=50]

u

1/u

γ2n cond

How to manage ill-conditioned cases?


Compensated algorithms

Algorithms that correct the generated rounding errors

Many examples:

I Compensated summation: Neumaier (74), Sum2 in Ogita-Rump-Oishi (05)

I Compensated Horner algorithm

I Compensated substitution for triangular system solving

Accuracy as if computed in twice the working precision:

accuracy . u + condition number× u2

More efficient than fixed length expansion libraries (double-double)

Not considered here: Kahan’s compensated summation (65), Priest’s

doubly compensated summation (92)


Accuracy of the result . u + condition number× u2.

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

1

100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35

rela

tive

forw

ard

erro

r

condition number

Accuracy of polynomial evaluation with the compensated Horner scheme [n=50]

u

1/u 1/u2

u + γ2n2 cond

How is computed the compensated result?


Principle of compensated Algorithms

The compensated result is r = r ⊕ c . The correcting term c is

an approximate of the forward error in r ,

computed thanks to “Error-Free Transformations”.


Compensated result

x

Input space D Output space R

Forward error

G

G

r = G(x)

r = G(x)

Forward error analysis


Compensated result

x

Input space D Output space R

x + ∆x

Backward errorForward error

G

G

G

r = G(x)

r = G(x)

Forward error analysis

Backward error analysis

Identify r as the exact solution of a perturbed problem:

r = G (x + ∆x)


Compensated result

x

r

r

r + c

c

c

G

G

Input space D Output space Rr

G

Backward error

x + ∆x

How to improve the quality of the computed result r ?

First, compute an approximate c of the forward error c = r − r .

c is a correcting term for r .

Then, a compensated result r = r ⊕ c .


How to compute the correcting term c?

c is an approximate of the forward error c = r − r

Classical error analysis:

Let p(x) =∑n

i=0 aixi be a polynomial with floating point coefficient.

I Backward error result: Horner algorithm computes

Horner (p, x) =nX

i=0

(1 + δi )aixi , with |δi | ≤ γ2n ≈ 2nu

I Forward error result:

c = p(x)− Horner (p, x) = −nX

i=0

δiaixi ⇒ |c| ≤ γ2n

nXi=0

|ai ||x |i .

As the δi are unknown, classical error analysis does not solve our problem. . .

But we can do better thanks to Error-Free Transformations!


Error-free transformations

Error-Free Transformations (EFT) are properties and algorithms to compute the

rounding errors at the current working precision.

+ (x , y) = 2Sum (a, b) 6 flops Knuth (74)

such that a + b = x + y and x = a⊕ b

× (x , y) = 2Prod (a, b) 17 flops Veltkamp

such that a× b = x + y and x = a⊗ b Dekker (71)

with a, b, x , y ∈ F.


How to compute the correcting term c thanks to EFT?


rn = an

for i = n − 1 : −1 : 0

pi = r i+1 ⊗ x % rounding error πi ∈ F ⇒ pi = r i+1x − πi

r i = pi ⊕ ai % rounding error σi ∈ F ⇒ r i = pi + ai − σi

For i = n − 1 : −1 : 0,

r i = r i+1x + ai − πi − σi .

Since rn = an,

Horner (p, x) =n∑

i=0

(ai − πi − σi )xi , with πn = σn = 0.


How to compute the correcting term c thanks to EFT?

c is an approximate of the forward error c = r − r

Express r w.r.t. the data and the elementary rounding errors:

Horner algorithm computes

Horner (p, x) =n∑

i=0

(ai − πi − σi )xi , with πn = σn = 0.

Deduce an expression for the forward error c = r − r :

c = p(x)− Horner (p, x) =n−1∑i=0

(πi + σi )xi .

If we manage to find a closed form formula for c w.r.t

I the data,

I the elementary rounding errors (exactly computable thanks to EFT),

we can easily compute a correcting term c .


An EFT for the Horner Algorithm

We have

c =n−1∑i=0

(πi + σi )xi = (pπ + pσ)(x),

with pπ(x) =∑

πiXi and pσ(x) =

∑σiX

i .

Algorithm (EFT for Horner)

function [ r0, pπ, pσ] = EFTHorner (p, x)

rn = an

for i = n − 1 : −1 : 0

[ pi , πi ] = 2Prod ( r i+1, x)

[ r i , σi ] = 2Sum ( pi , ai )

pπ[i ] = πi ; pσ[i ] = σi

This algorithm is an EFT for the Horner algorithm since

p(x) = Horner (p, x) + (pπ + pσ)(x)︸︷︷︸=c

.


Compensated Horner Algorithm

Since c = (pπ + pσ)(x), we compute c as Horner (pπ ⊕ pσ, x).

Algorithm (Compensated Horner scheme)

function r = CompHorner (p, x)

[ r , pπ, pσ] = EFTHorner (p, x) % r = Horner (p, x)

c = Horner (pπ⊕pσ, x)

r = r⊕ c

Next question: how to prove something about the accuracy of r ?

(Accuracy of the result . u + condition number× u2)

Difficult to answer in a general manner. . .


Sketch of proof for the compensated Horneralgorithm

The key “ingredient” here is the EFT for the Horner algorithm,

p(x) = Horner (p, x) + (pπ + pσ)(x)︸︷︷︸=c

.


Sketch of proof for the compensated Horner algorithm

We recall:

r = Horner (p, x),

c = (pπ + pσ)(x) is the forward error in r ,

c = Horner (pπ ⊕ pσ, x) is the computed correcting term,

r = r ⊕ c is the compensated result.

Since the compensated result is r = r ⊕ c ,

|p(x)− r | = |p(x)− (1 + ε)( r + c)|, with |ε| ≤ u.

Using the EFT for the Horner scheme, r = p(x)− c ,

|p(x)− r | = |p(x)− (1 + ε)(p(x) + c − c)|

≤ u|p(x)|+ (1 + u)| c − c |

How can we bound |c − c |?



|c − c | is the forward error in c = Horner (pπ ⊕ pσ, x). Then

|c − c | ≤ γ2n−1( ˜pπ + pσ)(x),

with ( ˜pπ + pσ)(x) =∑n−1

i=0 |pπ + pσ||x i |. We bound “largely” this term as

follows,

( ˜pπ + pσ)(x) ≤ γ2n p(x),

with p(x) =∑n

i=0 |ai ||x i |. Therefore,

|c − c | ≤ γ2n−1γ2n p(x).

Nota:

γk =ku

1− ku= ku +Ou2.



Then,

|p(x)− r | ≤ u|p(x)|+ (1 + u)γ2n−1γ2n p(x),

≤ u|p(x)|+ γ22n p(x).

Now we turn to relative accuracy, and we obtain the following theorem.

TheoremGiven p a polynomial with floating point coefficients, and x a floating point

value, let r be the compensated evaluation of p(x) computed with

CompHorner. Then,

|p(x)− r ||p(x)|

≤ u + γ22n cond(p, x).

Again, γ22n ≈ 4n2u2.


Accuracy of the result . u + condition number× u2.

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

1

100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35

rela

tive forw

ard

err

or

condition number

Accuracy of polynomial evaluation [n=25]

u

1/u 1/u2

γ2n cond u + γ2n2 cond

HornerDDHorner

CompHorner


Conclusion

We have recall how to define a correcting term:

correcting term = approximate of the forward error.

Compensating the Horner algorithm improves the accuracy:

I the accuracy of the compensated result is the same as if the result was

computed in doubled working precision.

I Remark: this is also true for compensated summation or compensated

triangular system solving.


Faithful rounding with the CHS


Faithful rounding

Definition

A floating point number x is said to be a faithful rounding of a real number x if

either x = x ,

or x is one of the two floating point neighbours of x .x

x

The error bound

|CompHorner (p, x)− p(x)||p(x)|

≤ u + γ22n︸︷︷︸

2n2u2

cond(p, x)

is too large for reasoning about faithful rounding.


An a posteriori test

We recall:

I r = CompHorner (p, x) is the compensated result,

I c = (pπ + pσ)(x) is the exact (real) correcting term for br ,I bc = Horner (pπ ⊕ pσ, x) is the computed (floating point) correcting term.

The following error bound on the computed correcting term holds:

|c − c | ≤ fl

(γ2n−1Horner (|pπ| ⊕ |pσ|, |x |)

1− 2(n + 1)u

)=: β.

Then, we can perform a dynamic test for faithful rounding:

Theorem

β <u

2| r | ⇒ |c − c | < u

2| r | ⇒ r is a faithful rounding of p(x).


An a posteriori test

1e-18

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

1

100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35

rela

tive

forw

ard

erro

r

condition number

Accuracy of polynomial evaluation with the compensated Horner scheme [n=50]

u

1/u 1/u2

u + γ2n2 cond

(1-u)/(2+u)uγ2n-2


Practical efficiency


Overhead to obtain more accuracy

Theoretical ratios (flops):

CompHorner

Horner∼ 10.5

CompHornerIsFaith

Horner∼ 13

DDHorner

Horner∼ 14

Some practical ratios (running times 2):

CompHornerHorner

CompHornerIsFaithHorner

DDHornerHorner

Pentium 4, 3.00 GHz GCC 3.3.5 3.77 5.52 10.00

ICC 9.1 3.06 5.31 8.88

Athlon 64, 2.00 GHz GCC 4.0.1 3.89 4.43 10.48

Itanium 2, 1.4 GHz GCC 3.4.6 3.64 4.59 5.50

ICC 9.1 1.87 2.30 8.78

∼ 2− 4 ∼ 4− 6 ∼ 5− 10

2Average ratios for polynomials of degree 5 to 200.Compensated Algorithms – N. Louvet April 18, 2007 26 / 33

Some comparisons

How does the more accurate algorithms compare to each other?

CompHornerIsFaithCompHorner

DDHornerCompHorner

DDHornerCompHornerIsFaith

Pentium 4, 3.00 GHz GCC 3.3.5 1.46 2.66 1.89

ICC 9.1 1.71 2.92 1.72

Athlon 64, 2.00 GHz GCC 4.0.1 1.14 2.70 2.38

Itanium 2, 1.4 GHz GCC 3.4.6 1.27 1.51 1.27

ICC 9.1 1.24 4.67 3.84

≤ 2 ∼ 2− 5 ∼ 2− 5


What is Instruction-Level Parallelism?

All processors since about 1985, including those in the embedded space,

use pipelining to overlap the execution of instructions and improve

performance. This potential overlap among instruction is called

instruction-level parallelism (ILP) since the instruction can be evaluated in

parallel. (Hennessy & Patterson)

A wide range of techniques have been developped to exploit the parallelism

available among instructions (pipelining, superscalar architectures...)

Amount of ILP available in a code:

I if two instructions are parallel they can execute simultaneously in a pipeline,

I if two instructions are dependent they must be executed in order.

How to determine wheter an instruction is dependent on another?


Dependences between instructions

Three different types of dependences:

I control dependences,

I name dependences,

I data dependences (or true dependences).

A control dependence determine the ordering of an instruction with respect

to a branch instruction.

A name dependence occurs when two instructions use the same register or

memory location (name), but there is in fact no flow of data between

instructions associated with that name.

But here we are mainly interested by data dependences.


Data dependences

An instruction i is data dependent on an instruction j if either

I instruction j produce a result that may be used by instruction i ,

I their exist a chain of dependences of the first type between i and j .

j

i i

k

j

If two instructions are data dependent, they cannot execute simultaneously.

Dependences are properties of programs: the presence of a data dependence

in an instruction sequence reflects a data dependence in the source code.

What about the dependences in CompHorner and DDHorner?


Difference between DDHorner and CompHorner

function r = CompHorner(P, x)

sn = ai; cn = 0

for i = n − 1 : −1 : 0

[pi, πi] = 2Prod(si+1, x)

[si, σi] = 2Sum(pi, ai)

ci = ci+1 ⊗ x ⊕ (πi ⊕ σi )

end

r = s0 ⊕ c0

function r = DDHorner(P, x)

shn = ai; sln = 0

for i = n − 1 : −1 : 0

%% [phi, pli]=[shi+1, sli+1] ⊗ x

[th, tl] = 2Prod(shi+1, x)

tl = sli+1 ⊗ x ⊕ tl

[phi, pli] = Fast2Sum(th, tl)

%% [shi, sli]=[phi, pli] ⊕ ai

[th, tl] = 2Sum(phi, ai)

tl = tl ⊕ pli

[shi, sli] = Fast2Sum(th, tl)

end

r = sh0


Difference between DDHorner and CompHorner

function r = CompHorner’(P, x)

sn = ai; cn = 0

for i = n − 1 : −1 : 0

[pi, πi] = 2Prod(si+1, x)

ti = ci+1 ⊗ x ⊕ πi

[si, σi] = 2Sum(pi, ai)

ci = ti ⊕ σi

end

r = s0 ⊕ c0

function r = DDHorner(P, x)

shn = ai; sln = 0

for i = n − 1 : −1 : 0

%% [phi, pli]=[shi+1, sli+1] ⊗ x

[th, tl] = 2Prod(shi+1, x)

tl = sli+1 ⊗ x ⊕ tl

[phi, pli] = Fast2Sum(th, tl)

%% [shi, sli]=[phi, pli] ⊕ ai

[th, tl] = 2Sum(phi, ai)

tl = tl ⊕ pli

[shi, sli] = Fast2Sum(th, tl)

end

r = sh0


Difference in the data flow graph

10

9

1

2

3

4

5

6

7

8

11

12

13

14

15

16

17

18

19

x_lor

r

x_lo x_hi

r r x

+

x_hi

x c

P[i]

P[i]

22

splitter

c

r

20*

+

21

P[i]

P[i]

sh splittersh x sl x

2

3

4

56

7

8

9

10

11

12

13

15

25

* *

−

−

−* *

* *−

+

+

+

+

+

sh

x_hi x_lo

ph

sh

14

pl

sl

21

+19

20−

23−−

22−

+24

26+

27−

+28

+16

−17

18−

*2

3−

−4

−5

*8

*6

−7

+9

*12 10

*

+11

+13

14−

15−

16−

18−

17−

+19

*1

*1

DDHornerCompHorner

We represent all data dependences in

the inner loop of each algorithm.

More parallelism among floating point

operations in CompHorner than in

DDHorner.

Thus more potential ILP, and greater

pratical performance!


Documents

Philippe Langlois, Nicolas Louvet Equipe DALI, Universit´e ... · Outline 1 Introduction 2 Principle of Compensated Algorithms 3 Sketch of proof for the compensated Horner algorithm