Nc 11111111111111111111

Embed Size (px)

DESCRIPTION

Nc 77777777777777777777

Citation preview

  • Chapter 1

    NUMERICAL ANALYSIS

    When a mathematical problem can be solved analytically, its solution may be exact,but more frequently, there may not be a known method of obtaining its solution.e.g.

    t0

    ex2dx

    1 x2 , 1 t 1

    is difficult to solve. There are many examples whose solutions by analytical methodare either impossible or may be so complex that they are quite unsuitable for prac-tical purposes. In this situation, the only way of obtaining an idea of the behaviorof a solution is to approximate the problem in such a manner that the numbersrepresenting the solution can be produced. The process of obtaining a solution is toreduce the original problem to a repetition of the same step or series of steps so thatcomputations become automatic. Such a process is called a numerical methodand a numerical method, which can be used to solve a problem, will be called analgorithm. An algorithm is a complete and unambiguous set of procedures lead-ing to the solution of a mathematical problem. The selection or construction ofappropriate algorithms properly falls within the discipline of numerical analysis.Having decided on a specific algorithm or set of algorithms for solving the problem,numerical analysts should consider all the sources of error that may affect the re-sults. They must consider how much accuracy is required, estimate the magnitudeof the round-off and discretization errors, determine an appropriate step size or thenumber of iterations required, provide for adequate checks on accuracy, and makeallowance for corrective action in case of non-convergence.

    Numerical analysis is a way to do higher mathematics problems on a computer,a technique widely used by scientist and engineers to solve their problems.

    Before starting we consider methods for representing numbers on computers andthe errors introduced by these representations.

    3

  • 1.0.1 The Representation of Integers

    In every day life, we use numbers based on the decimal system. Thus the number257, for example, is expressible as

    257 = 2 100 + 5 10 + 7 1= 2 102 + 5 101 + 7 100

    we call 10 the base of this system. Any integer is expressible as a polynomial in thebase 10 with integral coefficients between 0 and 9. We use the notation

    N = (anan1an2 a0)10= an 10n + an1 10n1 + an2 10n2 + + a0 100

    Modern computers read pulses sent by electrical components. The state of anelectrical impulse is either on or off. It is therefore, convenient to represent numbersin computers in the binary system. Here the base is 2, and the integer coefficientmay take the values 0 and 1. A nonnegative integer N will be represented in thebinary system as

    N = (anan1an2 a0)2= an 2n + an1 2n1 + an2 2n2 + + a0 20

    where the coefficient ak are either 0 or 1. Note that N is again represented as apolynomial, but now in the base 2. Many computers used in scientific work, operateinternally in the binary system. Users of computers, however, prefer to work inthe more familiar decimal system. Then computer converts their inputs to base 2(or perhaps base 16), then performs base 2 arithmetic, and finally, translates theanswer into base 10 before it prints it out to them. It is therefore necessary to havesome means of converting from decimal to binary when submitting information tothe computer, and from binary to decimal for output purposes. Conversion of thebinary number to decimal may be accomplished from the above definition as

    (11)2 = 1 21 + 1 20 = 3(1101)2 = 1 23 + 1 22 + 0 21 + 1 20 = 13

    and decimal number to binary as

    187 = (187)10 = 1 102 + 8 101 + 7 100= (1)2 (1010)22 + (1000)2 (1010)12 + (111)2 (1010)02= (1010)2 ((1010)2 + (1000)2) + (111)2= (1010)2 (10010)2 + (111)2= (10110100)2 + (111)2

    = (10111011)2

    4

  • However, if we look into the machine languages, we soon realize that other numbersystems, particularly the octal and hexadecimal systems, are also used. The octaland hexadecimal systems are close relatives of the binary and can be translated toand from binary easily. Expression in octal and hexadecimal are shorter than inbinary, so they are easier for humans to read and understand. Hexadecimal alsoprovides more efficient use of memory space for real numbers.

    The octal number system using the base 8 presents a kind of compromise betweenthe computer-preferred binary and the people-preferred decimal system. It is easyto convert from octal to binary and back since three binary digits make one octaldigit. To convert from octal to binary one merely replaces all octal digits by theirbinary equivalent; thus

    187 = (187)10 = (1) (10)2 + (8) (10)1 + (7) (10)0= (1)8 (12)28 + (10)8 (12)18 + (7)8 (12)08= (12)8 ((12)8 + (10)8) + (7)8= (12)8 (22)8 + (7)8= (264)8 + (7)8

    = (273)8

    = (2 7 3)8

    = (010 111 011)2

    1.0.2 The Representation of Fractions

    If x is a positive real number, then its integral part xl is the largest integer lessthan or equal to x, while

    xF = x xlis its fractional part. The fractional part can always be written as a decimalfraction:

    xF = bk10k

    where each bk is a nonnegative integer less than 10. If bk = 0 for all k greater thana certain integer, then the fraction is said to terminate. Thus

    1

    4= 0.25 = 2 101 + 5 102

    is terminating decimal fraction since bk = 0 for all k > 3, while

    1

    3= 0.3333 = 3 101 + 3 102 + 3 103 +

    5

  • is not. Here the symbol 3 means that the digit 3 is repeated forever to form adecimal. If the integral part of x is given as a decimal integer by

    xl = (anan1an2 a0)10and the fractional part is given by

    xF = bk10k

    then

    x = (anan1an2 a0.b1b2b3 )10one after the other, separated by a point, the decimal point.

    Completely analogously, one can write the fractional part of x as a binary frac-tion:

    xF = bk2k k = 1, 2, 3,

    where each bk is a nonnegative integer less than 2, i.e., either 0 or 1. If the integralpart of x is given by the binary integer

    xl = (anan1an2 a0)2then we write

    x = (anan1an2 a0.b1b2b3 )2using a binary point.

    The binary fraction (.b1b2b3 )2 for a given number xF between zero and onecan be calculated as follows:If

    xF = bk2k k = 1, 2, 3,

    then

    2xF = bk2k+1 k = 1, 2, 3,

    = b1 + bk+12k k = 1, 2, 3,

    Hence b1 is the integral part of 2xF , while

    2xF b1 = bk+12k k = 1, 2, 3, = (2xF )F

    (2xF )F = bk+12k k = 1, 2, 3,

    2(2xF )F = b2 + bk+22k k = 1, 2, 3,

    6

  • b2 is the integral part of 2(2xF )F ,

    (2xF )F b2 = bk+22k k = 1, 2, 3, = (2(2xF )F )F

    Therefore repeating this procedure we find that b3 is the integral part of 2(2(2xF )F )Fand so on.

    Example:-

    If x = 0.625 = xF , then

    2(0.625) = 1.25 so b1 = 1

    2(0.25) = 0.5 so b2 = 0

    2(0.5) = 1 so b3 = 1

    and all further bks are zero. Hence

    0.625 = (.101)2

    Inversely, if x = (.101)2 then in decimal system where base is 10, the decimal fraction(.b1b2b3 )10 for a given number xF between zero and nine can be calculated asfollows

    xF =

    bk10k k = 1, 2, 3,

    Here xF = (.101)2, then multiplying this number with 10 = (1010)2 instead of 2i. e.,

    10 xF = 10 (.101)2 = (1010)2 (.101)2 = (110.010)2.

    So integral part of 10 xF is (110)2 = 6, i.e., b1 = 6 and

    (10 xF )F = (.010)2.

    10 (10 xF )F = (1010)2 (.010)2 = (10.10)2.

    Here integral part of 10 (10 xF )F is (10)2 = 2, i.e., b2 = 2 and

    (10 (10 xF )F )F = 10 (.10)2. = (1010)2 (.10)2 = (101.0)2.

    The integral part of (10 (10 xF )F )F is (101)2 = 5, i.e., b3 = 5 and

    (10 (10 (10 xF )F )F )F = 10 (.0)2. = (1010)2 (.0)2 = (0.0)2 = 0.

    showing that b4 = 0. Hence subsequent bks are zero. This shows that

    (.101)2 = 0.625

    7

  • Note that if xF is a terminating binary fraction with n digits, then it is also aterminating decimal fraction with n digits since

    (.1)2 = 0.5

    We shall not go further into the messy field of binary representation arithmetic andits pitfalls because much depends on the machine used, on the programs suppliedby the computer manufacturere and on the computer center, but it should be clearthat the system of binary representation of numbers is going to affect our answersin many ways.

    1.1 The Three Number Systems

    Besides the various bases for representing numbers - decimal, binary, and octal -there are also three distinct number systems that are used in computing machines.

    First, there are the integers, or counting numbers, For example, 0, 1, 2, 3, that are used to index and count and have limited usage for numerical analysis.Usually, they have the range from 0 to the largest number that can be contained inthe machines index registers.

    Second there are the fixed-point numbers. For example

    367.143 258 765593, 245.678 9530.001 236 754 56

    The fixed-point number system is the one that the programmer has implicitly usedduring much of his own calculation, and it is the one with which he is most familiar.Perhaps the only feature that is different in hand and machine calculations is thatthe machine always carries the same number of digits, whereas in hand calculationthe user often changes the number of figures he carries to fit the current needs ofthe problem.

    Third, there is the floating-point number system, which is the one used inalmost all practical scientific and engineering computations. This number systemdiffers in significant ways from the fixed-point number system, and we must be awareof these differences at many stages of long computation. Typically, the computerword length includes both the mantissa and exponent; thus the number of digits inthe mantissa of a floating point number is less than in that of a fixed point.

    1.1.1 Floating-Point Arithmetic

    Scientific and engineering calculations are usually carried out in floating-point arith-metic. To examine round-off error in detail, we need to understand how numeric

    8

  • quantities are represented in computers. In nearly all cases, numbers are storedas floating-point quantities, the computer has a number of values it chooses fromto store as an approximation to the real number. The term real numbers is forthe continuous (and infinite) set of numbers on the number line. When printedas a number with a decimal point, it is either fixed point or Floating-point, incontrast to integers.

    Floating-point numbers have three parts:

    1. the sign (which requires one bit);

    2. the fraction part-often called the mantissa but better characterized by thename significand;

    3. the exponent part-often called the characteristic.

    The three parts of the numbers have a fixed total length that is often 32 or 64 bits(sometimes even more). The fraction part uses most of these bits, perhaps 23 to asmany as 52 bits, and that number determines the precision of the representation.The exponent part uses 7 to as many as 11 bits, and this number determines therange of the values.

    The general form of a floating-point number is

    .a1a2a3 ap Be

    where the a1 6= 0 and ais are digits or bits with values from zero to B 1 and B = the number base that is used, usually 2, 16, 10. p = the number of significand bits (digits), that is, the precision. e = an integer exponent, ranging from Emin to Emax, with the values goingfrom negative Emin to positive Emax.

    The significand bits(digits) constitute the fractional part of the number. In almostall cases, numbers are normalized, meaning that the fraction digits are shifted andexponent adjusted so that a1 is nonzero. e.g.,

    27.39 +.2739 102;0.00124 .1240 102;37000 +.3700 105.

    Observe that we have normalized the fractions-the first fraction digit is nonzero. Zerois a special case; it usually has a fraction part with all zeros and a zero exponent.This kind of zero is not normalized and never can be. In hand calculators, the baseis usually 10; in computers the base is often 2, but sometimes a base of 16 is used.Most computers permit two or even three types of numbers:

    9

  • 1. single precision, use the letter E in the exponent which is usually equivalentto seven to nine significant decimal digits;

    2. double precision, use the letter D in the exponent instead of E varies from14 to 29 significant decimal digits, but is typically about 16 or 17.;

    3. extended precision, which may be equivalent to 19 to 20 significant decimaldigits.

    Calculation in double precision usually doubles the storage requirements andmore than doubles running time as compared with single precision.

    Method Largest Number Smallest Number

    IEEEsingle 1.701E38 1.755E-38double 8.988E307 2.225E-308

    extended 6E4931 3E-4931VAXsingle 1.701E38 5.877E-39

    double-1 1.701E38 5.877E-39double-2 8.988E307 1.123E-308extended 6E4931 1E-4931

    IBMsingle 7.237E75 8.636E-78double 7.237E75 8.636E-78

    extended 7.237E75 8.636E-78

    The finite range of the exponent also is a source of trouble, namely, what arecalled overflow and underflow which refer respectively to the numbers exceedingthe largest - and the smallest-sized (non-zero) numbers that can be representedwithin the system.

    It should be evident that we can replace an underflow by a zero and often not gofar wrong. It is less safe to replace a positive overflow by the largest number thatthe system has (to prevent some subsequent overflows due to future additions).

    We may wonder how, in actual practice, with a range of 1038 to 1038 or more,we can have trouble with overflow and underflow.

    Numerical methods provide estimates that are very close to the exact analyti-cal solutions: obviously, an error is introduced into the computation. This error isnot a human error, such as a blunder or mistake or oversight but rather a discrep-ancy between the exact and approximate (computed) values. In fact, numericalanalysis is a vehicle to study errors in computations. It is not a static dis-cipline. The continuous change in this field is to devise algorithms, which are both

    10

  • fast and accurate. These algorithms may become obsolete and may be replacedby algorithms that are more powerful. In the practice of numerical analysis it isimportant to be aware that computed solutions are not exact mathematical solu-tions but numerical methods should be sufficiently accurate1 (or unbiased) to meetthe requirements of a particular scientific problem and they also should be precise2

    enough. The precision of a numerical solution can be diminished in several subtleways. Understanding these difficulties can often guide the practitioner in the properimplementation and/or development of numerical algorithms.

    1.2 Error Analysis

    Error analysis is the study and evaluation of error. The accuracy of anycomputation is always of great importance. Every floating-point operation in acomputational process may give rise to an error which, once generated, may thenbe amplified or reduced in subsequent operations. An error in a numerical compu-tation is simply the difference between the actual (true) value of a quantity and itscomputed (approximate) value. There are three common ways to express the size ofthe error in a computed result: Absolute error, Relative error and PercentageError.

    Suppose that x is an approximation (computed value) to x. The error is

    = x x,

    1.2.1 Absolute Error:-

    The absolute error of a given result is frequently used as a measure of accuracy; theconventional definition is

    absolute error = |true value - approximate value|Ea = |x x|

    However, a given error is usually much more serious when the magnitude of thetrue value is small. For example, 1036.520.010 is accurate to five significant digitsand is frequently of more than adequate precision, while 0.005 0.010 is a cleardisaster.

    1.2.2 Relative Error:-

    The relative error =|true value - approximate value|

    |true value|1Accuracy is the number of digits to which an answer is correct.2Precision is the number of digits in which a number is expressed or a given, irrespective of

    the correctness of the digits.

    11

  • Er =Ea|x| ; x

    6= 0

    If actual value is not known, then

    Er =Ea|x| ; x 6= 0

    is often a better indicator of the accuracy. Relative error is more independent ofthe scale of the value, a desirable attribute. This is particularly so when the actualvalue is either very small or very large. When the true value is zero, the relativeerror is undefined. It follows that the round-off error due to finite-fraction lengthin floating-point numbers is more nearly constant when expressed as relative errorthan when expressed as absolute error. Observe that the loss of significant digitswhen nearly equal floating-point numbers are subtracted produces a particularlysever relative error.

    Examples:-

    1. If x = 0.3000 101 and x = 0.3100 101, the absolute error is 0.1 and therelative error is 0.3333 101.

    2. If x = 0.3000 103 and x = 0.3100 103, the absolute error is 0.1 104and the relative error is 0.3333 101.

    3. If x = 0.3000 104 and x = 0.3100 104, the absolute error is 0.1 103 andthe relative error is 0.3333 101.This example shows that the same relative error 0.3333 101, occurs forwidely varying absolute errors. As a measure of accuracy, the absolute errormay be misleading and the relative error more meaningful.

    4. Consider the following three cases

    (a) Let x = 3.141592 and x = 3.14; then the absolute error is

    Eax = | x x| = | 3.141592 3.14| = 0.001592

    and the relative error is Erx =| 0.001592|| 3.141592| = 0.000507

    12

  • (b) Let y = 1, 000, 000 and y = 999, 996; then the absolute error is

    Eay = | y y| = | 1, 000, 000 999, 996| = 4

    and the relative error is Ery =| 4|

    | 1000000| = 0.000004

    (c) Let z = 0.000012 and z = 0.000009; then the absolute error is

    Eaz = | z z| = | 0.000012 0.000009| = 0.000003

    and the relative error is Erz =| 0.000003|| 0.000012| = 0.25

    In case (a) there is not too much difference between Eax and Erx and either couldbe used to determine the accuracy of x. In case (b) the value of y is of magnitude106, the error Eay is large, and the relative error Ery is small. We would call y

    agood approximation to y. in case (c) z is of magnitude 106 and the error Eaz isthe smallest in all three cases, but the relative error Erz is the largest. In terms ofpercentage, it amounts to 25%, and thus z is a bad approximation to z.

    1.2.3 Percentage error: -

    Relative error expressed in percentage is called the percentage error, defined by

    PE = 100 ErIn order to investigate the effect of total error in a method we often compute an

    error bound which is the limit on how large and small the error can be.

    1.2.4 Significant Digits

    In considering rounding errors, it is necessary to be precise in the usage of approxi-mate digits. A significant digit in an approximate number is a digit, which givesreliable information about the size of the number. In other words, a significant digitis used to express accuracy, i.e., how many digits in the number have meaning. Thesignificant digits of a (measured or calculated) quantity are the meaningful digitsin it. There are conventions which you should learn and follow for how to expressnumbers so as to properly indicate their significant digits.

    Any digit that is not zero is significant. Thus 549 has three significant digitsand 1.892 has four significant digits.

    13

  • Zeros between non zero digits are significant. Thus 4023 has four significantdigits.

    Zeros to the left of the first non zero digit are not significant. Thus 0.000034has only two significant digits. This is more easily seen if it is written as3.4 105.

    For numbers with decimal points, zeros to the right of a non zero digit aresignificant. Thus 2.00 has three significant digits and 0.050 has two significantdigits. For this reason it is important to keep the trailing zeros to indicate theactual number of significant digits.

    For numbers without decimal points, trailing zeros may or may not be sig-nificant. Thus, 400 indicates only one significant digit. To indicate that thetrailing zeros are significant a decimal point must be added. For example, 400.has three significant digits, and 4 102 has one significant digit.

    Exact numbers have an infinite number of significant digits. For example, ifthere are two oranges on a table, then the number of oranges is 2.000... .Defined numbers are also like this. For example, the number of centimetersper inch (2.54) has an infinite number of significant digits, as does the speedof light (299792458 m/s).

    There are also specific rules for how to consistently express the uncertainty associ-ated with a number. In general, the last significant digit in any result should be ofthe same order of magnitude (i.e. in the same decimal position) as the uncertainty.Also, the uncertainty should be rounded to one or two significant digits. Alwayswork out the uncertainty after finding the number of significant digits for the actualmeasurement. For example,

    9.82 0.0210.0 1.54 1

    The following numbers are all incorrect.

    9.82 0.02385 is wrong but 9.82 0.02 is fine10.0 2 is wrong but 10.0 2.0 is fine4 0.5 is wrong but 4.0 0.5 is fine

    In practice, when doing mathematical calculations, it is a good idea to keep onemore digit than is significant to reduce rounding errors. But in the end, the answermust be expressed with only the proper number of significant digits. After additionor subtraction, the result is significant only to the place determined by the largestlast significant place in the original numbers. For example,

    89.332 + 1.1 = 90.432

    14

  • should be rounded to get 90.4 (the tenths place is the last significant place in 1.1).After multiplication or division, the number of significant digits in the result isdetermined by the original number with the smallest number of significant digits.For example,

    (2.80)(4.5039) = 12.61092

    should be rounded off to 12.6 (three significant digits like 2.80).

    1.2.5 Loss of Significance and Error Propagation: Condi-tion and instability.

    One of the most common (and often avoidable) ways of increasing importance of anerror is commonly called loss of significant digits. If X is an approximation to x,then we say that X approximates x to r significant -digits provided the absoluteerror |x X| is at most 1

    2in the rth significant -digit of x. This can be expressed

    as

    |xX| 12sr+1

    with s the largest integer such that s |x|. For instance, X = 3 agrees with x = pito one significant (decimal) digit, while X = 22

    7= 3.1428 is correct to three

    significant digits (as an approximation to pi).Once an error is committed, it contaminates subsequent results. This error

    propagation through subsequent calculations is conveniently studied in terms ofthe two related concepts of condition and instability.

    The word condition is used to describe the sensitivity of the function value f(x)to changes in the argument x. The condition is usually measured by the maximumrelative change in the function value f(x) caused by a unit relative change in theargument x.

    An example to illustrate the avoidance of loss of significance isExample:- Compare the results of computing f(500) and g(500) using six digits

    and rounding.

    f(x) = x[x + 1x]

    g(x) =x

    x+ 1 +x

    f(500) = 500[500 + 1

    500]

    = 500[22.3830 22.3607]= 500 0.0223= 11.1500

    g(500) =500

    500 + 1 +500

    15

  • =500

    22.3830 + 22.3607

    =500

    44.7437= 11.1748

    The function g(x) is algebraically equivalent to f(x) as

    f(x) = x[x + 1x]

    x+ 1 +

    x

    x+ 1 +x

    =x

    x+ 1 +x

    the answer g(500) = 11.1748 involves less error and is the same as that obtained byrounding the true answer 11.174753 to six digits.

    All that is required is that the person computing should use his imagination andforesee what might happen before he writes the program for a machine. As a simplerule, try to avoid subtractions ( even if they appear as a sum but with the sign ofone of the terms negative and the other positive).

    Example:-

    (x + )23 x 23 =

    [(x + )

    23 x 23

    ] (x + ) 43 + (x+ ) 23x 23 + x 43(x + )

    43 + (x+ )

    23x

    23 + x

    43

    =

    [(x+ )

    23

    ]3 (x 23 )3(x + )

    43 + (x+ )

    23x

    23 + x

    43

    =2x + 2

    (x + )43 + (x+ )

    23x

    23 + x

    43

    Other methods for rearranging an expressionSimple rearrangements will not always produce a satisfactory expression for com-

    puter evaluation, and it is necessary to use other devices that occur in the calculuscourse.

    For small positive x :

    1 ex = x x2

    2!+x3

    3!

    ln(1 x) = (x + x2

    2!+x3

    3!+ )

    16

  • tan x sin xx3

    =

    (x + x

    3

    3+ 2x

    5

    15+

    )(x x3

    6+ x

    5

    120+

    )x3

    =

    (13+ 1

    6

    )x3 +

    (215 1

    120

    )x5 +

    x3

    =1

    2+x2

    8+

    Another technical device of less practical use but of great theoretical value is themean-value theorem

    f(b) f(a) = (b a)f () (a < < b)

    Whereas the value of is not known and in principle can be anywhere inside theinterval (a, b), it is reasonable to suspect that the choice of the midvalue is as goodas any other value if nothing else is known about the function.

    Example:-

    For x small with respect to a,

    ln(a+ x) ln a = ln(1 + xa)

    Also by using the mean-value theorem

    ln(a+ x) ln a = xa+

    =x

    a+ x2

    The main sources of error are

    Gross errors

    Errors in original data

    Round-off errors

    Truncation errors

    They all cause the same effect: diversion from the exact answer. Some errorsare small and may be neglected while others may be devastating if overlooked.

    17

  • 1.2.6 Gross Errors

    When humans are involved in programming, operations, preparing the input, andinterpreting the output, blunders or gross errors do occur rather more frequentlythan we like to admit. A few examples of these errors are

    Poor definition of the problem, Choice of an inappropriate model, Approximation made in representing physical processes by mathematical op-erations,

    Misreading or misquoting the digits, particularly in the interchange of adjacentdigits.

    Use of inaccurate formula(algorithm) to solve a particular problem, and Use of inaccurate data.These can be avoided by taking enough care, coupled with a careful examina-

    tion of the results for reasonableness. Sometimes a test run with known results isworthwhile, but this is no guarantee of freedom from foolish error.

    1.2.7 Errors in Original Data

    Real world problems, in which an existing or proposed physical situation is modeledby a mathematical equation, will nearly always have coefficients that are imperfectlyknown. The reason is that the problems often depend on measurements of doubtfulaccuracy. Further, the model itself may not reflect the behavior of the situationperfectly. We can do nothing to overcome such errors by any choice of method, butwe need to be aware of such uncertainties; in particular, we may need to performtests to see how sensitive the results are to changes in the input information. Sincethe reason for performing the computation is to reach some decision with validityin the real world, sensitivity analysis is of extreme importance. As Hamming says,the purpose of computing is insight, not numbers.

    There are errors, which arise after a mathematical formulation is obtained. Theyinclude not only computational errors in the strict sense but also those errors, whicharise because we substitute finite mathematical processes for infinite mathematicalprocesses. An example of this is the substitution of the sum of a finite series for thevalue of a function. These are the errors of mathematical approximation. Computa-tional errors might more appropriately be named the errors of numerical methods.

    The finite representation of numbers in the machine leads to roundoff errors,whereas the finite representation of processes leads to truncation errors.

    18

  • 1.2.8 Truncation Error

    The term truncation error refers to those errors caused by the method itself, e.g.,caused by the approximations used in the mathematical formula of the scheme, whena more complicated mathematical expression is replaced with a more elementaryformula. The error arising from this approximation, is called the truncation error.This terminology originates from the technique of replacing a complicated functionwith a truncated Taylor/Maclaurin series, Binomial expansion, Infinite geometricprogression or any other approximation. For example, the infinite Taylor series

    ex2 = 1 +x2

    1!+x4

    2!+x6

    3!+ + x

    n

    n!+

    might be replaced with just the five terms

    ex2 = 1 +x2

    1!+x4

    2!+x6

    3!+x8

    4!

    This might be done when approximating an integral numerically.Example:-

    Given that I = 1/20 e

    x2 dx = 0.544987104184 determine the accuracy of theapproximation obtained by replacing the integrand f(x) = ex2 with the truncatedTaylor series

    P4(x) = 1 +x2

    1!+x4

    2!+x6

    3!+x8

    4!.

    Solution:-

    I = 1/20

    ex2 dx 1/20

    (1 +x2

    1!+x4

    2!+x6

    3!+x8

    4!) dx

    = x+x3

    3+

    x5

    5 2!+

    x7

    7 3!+

    x8

    9 4!

    1/2

    0

    =1

    2+

    1

    24+

    1

    320+

    1

    5376+

    1

    110592

    =2109491

    3870720= 0.544986720817 = I

    Er =|I I||I| = 7.03442 10

    7

    The approximation I agrees with the true answer I to five significant digits.

    Exercise

    19

  • 1.2.9 Rounding Errors

    This is the most basic source of errors in a computer. All computing devices rep-resent numbers, except for integers, with some imprecision. Digital computers willnearly always use floating-point numbers of fixed word length; the true values arenot expressed exactly by such representations. Round-off error occurs when a cal-culator or computer is used to perform real number calculations. This error arisesbecause the arithmetic performed in a machine involves numbers with only a finitenumber of digits, say, n significant digits by rounding off the (n + 1)th place anddropping all digits after the nth with the result that calculations are performed withapproximate representations of the actual numbers. That is, the error introduced byrounding-off numbers to a limited number of decimal places is called the roundingerror, or the error that results from replacing a number with its floating-point formis called the rounding error.

    For example

    pi = 0.314159265 101

    The five-digit floating point form of pi using chopping is

    = 0.31415 101 = 3.1415

    and is called chopped floating point representation of pi. Since sixth digit of thedecimal expansion of pi is a 9, the floating point form of pi using five-digit roundingis

    = (0.31415 + 0.00001) 101 = 3.1416

    and is called rounded floating point representation of pi The error that resultsfrom replacing a number with its floating-point form is called round-off error(regardless of whether the rounding or chopping method is used).

    Consider another example :- When two 3-digit numbers are multiplied together,their product has either five or six places.

    20

  • 0.236 1010.127 1011652472236

    0.0299|72 1020.|5 roundoff

    Answer 0.300| (drop) 101

    |roundoff error| = 0.28 102

    As above, the product is0.0299|72 102

    | chop

    Answer 0.299 101

    |roundoff error| = 0.72 102When machine drop without rounding, which is called chopping; this can cause

    serious trouble.Round-off causes trouble mainly when two numbers of about the same size are

    subtracted. As a result of the cancellation of the leading digits, the number is shifted(normalized) to the left until the first digit is not zero. This shifting can bring theround-off errors that where in the extreme right part of the number well into themiddle, if not to the extreme left. In the later steps we shall think that we have anaccurate number when we do not.

    A second, more insidious trouble with round off , especially with chopping, isthe presence of internal correlations between numbers in the computation so that,step after step, the small error is always in the same direction and one is thereforenot under the protective umbrella of the statistical average behavior.

    Example:-

    1.2.9.1 Accumulated Round-off Error & Local Round-off Errors

    Analysis of the round-off error present in the final result of a numerical computation,usually termed the accumulated round-off error, and the errors resulting from

    21

  • individual rounding or truncating operations are called local round-off errors.

    1.2.10 Error Accumulation in Computations

    To investigate, how error might be accumulated in computations, we proceed asfollows

    1. Error Accumulation in Addition

    Consider the addition of two numbers p and q (the true values) with ap-proximate values p and q, with errors p and q respectively. i.e.,

    p = p + p and q = q + q

    Let z = p+ q, with error z and z = p + q. Then the sum is

    z + z = p + p + q + q= p + q + p + q

    z = p + q

    Hence, for addition, the error in the sum is the sum of the errors of the addends.The absolute error of the sum of two numbers is the sum of the absolute errorsof the given numbers i. e.,

    Ea = |z| |p|+ |q|This formula can be extended to any number of terms.

    Ea = |z| |1|+ |2|+ |3|+ + |n|

    The relative error is calculated as

    Er =Absolute Error

    Sum of the given numbers=

    Ea|z|

    2. Error Accumulation in Subtraction

    Let z = p q where p > q and z = p q.z = z + z = (p + p) (q + q)

    = (p q) + (p q)implies

    z = p qEa = |z| |p|+ |q|

    22

  • Which is same as above. Hence, the absolute error of a difference between twonumbers is the sum of the absolute errors of the given numbers. This formulacan be extended to any number of terms.

    Ea = |z| |1|+ |2|+ |3|+ + |n|

    The relative error is calculated as

    Er =Absolute Error

    Sum of the given numbers=

    Ea|z|

    3. Error Accumulation in Multiplication

    The propagation of error in multiplication is more complicated. Let z = p qand z = pq. Then the product is

    z = p q = (p + p)(q + q) = pq + pq + qp + p q

    Hence, if p and q are larger than 1 in absolute value, the terms pq and qpshow that there is a possibility of magnification of the original errors p andq. Insights are gained if we look at the relative error. Rearrange the terms in(??) to get

    z = p q pq = pq + qp + p qSuppose p 6= 0 and q 6= 0; then dividing (??) by p q, we obtain

    zp q

    =pqp q

    +qpp q

    +p qp q

    Furthermore, suppose that

    p

    p 1, q

    q 1, p q

    p q 0

    Then the relative error |z||p q|

    |q||q| +

    |p||p|

    This shows that the relative error in the product p q is approximately the sumof the relative errors in the approximations p and q, that is, the relativeerror modulus of the product of two numbers does not exceed the sum of therelative error moduli of the given numbers. The relative error modulus of theproduct of n numbers

    Er = |zz|

    1x1+ 2x2

    + + nxn

    23

  • 4. Error Accumulation in Division

    Let z =p

    q; q 6= 0 and z = p

    q; q 6= 0, then

    z =(p + p)(q + q)

    =[p (1 +

    pp)]

    [q (1 +qq)]

    =p

    q(1 +

    pp) (1 +

    qq)1

    expanding with the help of binomial theorem and ignoring the product oferrors being small, we have

    z = z + z =p

    q(1 +

    pp) (1 q

    q)

    =p

    q+pq p

    qq q

    z =pq p

    qq q

    then

    zz

    =

    pq p

    qq qp

    q

    =q

    p

    (pq p

    qq q

    )

    =q

    qpp p

    p

    q

    qq

    qqq

    We have already supposed thatp

    p 1, q

    q 1, using this

    zz

    =pp q

    q

    Er =zz =

    pp qq

    pp+

    qq

    24

  • Thus, the relative error of a quotient of two numbers is equivalent to the sumof the relative error moduli of the dividend and divisor and for n terms

    Er =

    zz 1x1

    + 2x2+ 3x3

    + + nxn

    5. Errors of Powers and Roots

    Let z = xn, where n is the power and denotes an integral or a fractionalquantity, then

    z = z + z = (x + x)n

    = xn (1 +xx)n

    expanding with the help of binomial theorem and neglecting the higher powers

    ofxx, we get

    z = xn (1 +nxx

    )

    = xn + nxxn1

    therefore, z = nxxn1

    thenzz

    =n x x

    n1

    xn

    =n xx

    Er =

    zz = n xx

    |n|

    xx

    Thus, the relative error modulus of a factor raised to a power is the productof the modulus of power and the relative error of the factor.

    6. Error in Function Evaluation

    Let z = f(x), thenz + z = f(x + )

    using the Taylors series expansion and neglecting the higher powers of , beingsmall, we have

    z + z = f(x) + f(x)

    25

  • Or,

    z = f(x)

    Therefore,

    Ea = | z| | f (x)|Er =

    zz =

    f (x)f(x)

    The formula can be extended for any number of factors, e.g., if

    z = f(x1) + f(x2) + f(x3) + + f(xn),then

    z + z = f(x1 + 1) + f(x2 + 2) + + f(xn + n)z = 1 f

    (x1) + 2 f (x2) + + n f (xn)Ea = | z| = | 1 f (x1) + 2 f (x2) + + n f (xn)|

    | 1 f (x1)|+ | 2 f (x2)|+ + | n f (xn)|and

    Er =

    zz =

    1 f (x1) + 2 f (x2) + + n f (xn)f(x1) + f(x2) + f(x3) + + f(xn)

    | 1 f(x1)|+ | 2 f (x2)|+ + | n f (xn)|

    | f(xn)|or | 1 f

    (x1)|| f(xn)| + | 2 f

    (x2)|| f(xn)| + + | n f

    (xn)|| f(xn)|

    1.2.10.1 Propagated Error

    The local error at any stage of the calculation is propagated through out the remain-ing part of the computation, i.e., the error in the succeeding steps of the process dueto the occurance of an earlier error. Propagated error is more subtle than theother errors-such errors are in addition to the local errors. Propagated error is ofcritical importance. If errors are magnified continuously as the method continues,eventually they will overshadow the true value, destroying its validity; we call sucha method unstable. For a stable method-the desirable kind- errors made at earlypoints die out as the method continues. Whenever possible we shall choose methodsthat are stable. The following definition is used to describe the propagation of error.

    Definition:-

    Suppose that E(n) represents the growth of error after n steps. If |E(n)| n,the growth of error is said to be linear . If |E(n)| kn, the growth of error is called

    26

  • exponential. If k > 1, the exponential error grows without bound as n, andif 0 < k < 1, the exponential error diminishes to zero as n.

    1.2.11 Numerical Cancellation

    Accuracy is lost when two nearly equal numbers are subtracted. For example, thetwo numbers 9.4157233 and 9.4157227 are each accurate to 8 significant digits, yettheir difference 0.0000006 is accurate to only 1 significant digit. Thus care shouldbe taken to avoid such subtraction where possible, because this is the major sourceof error in floating point operations. This phenomenon is also called subtractivecancellation.

    In multiplying two ndigits numbers, when using computer, a product with 2ndigits results. Internally, double length registers are used. The result is truncatedto the length of a single register.

    1.2.12 Errors in Converting Values

    The numbers that are input to a computer are ordinarily base10 values. Thus theinput must be converted to the computers internal number base, normally base 2.This conversion itself causes some errors.

    1.2.13 Machine eps

    One important measure in computer arithmetic is how small a difference betweentwo values the computer can recognize. This quatity is termed the computer eps,where eps is for Greek letter epsilon. This measure of machine accuracy is stan-dardized by finding the smallest floating-point number that, when added to floating-point 1.000, produces a result different from 1.000. Numbers smaller than eps areeffectively zero in the computer.

    Peculiar things happen in floating-point arithmetic. For example, adding 0.001one thousand times may not equal 1.0 exactly. In some instances, multiplying anumber with unity does not reproduce the number.

    In many computations, changing the order of calculations will produce differentresults.

    1.2.14 Evaluation of Functions by Series Expansion and Es-timation of Errors

    Taylors series is considered as a foundation of numerical analysis. It is the mostimportant tool for deriving numerical methods and analyzing errors.

    27

  • If f(x) is analytic about x = x, then f(x) in the neighbourhood of x = x canbe exactly represented in the Taylor series, which is power series given by

    f(x) = f(x) + (x x)f (x) + (x x)2

    2!f (x)

    +(x x)3

    3!f (x) +

    (x x)44!

    f (4)(x) +

    This series is unique, that is, there is no other power series in (x x) to representf(x).

    In practical applications, the Taylor series has to be truncated after a certainorder term because it is impossible to include an infinite number of terms. If theTaylor series is truncated after the N th term, it is expressed as

    f(x) = f(x) + hf (x) +h2

    2!f (x) +

    h3

    3!f (x) +

    +hm

    m!f (m)(x) + + h

    N

    N !f (N)(x) +O(hN+1)

    where h = x x and O(hN+1) represents the error caused by truncating theterms of order N + 1 and higher. We can also write the above expression as

    f(x) = PN(x) +RN (x)

    where

    PN(x) = f(x) + hf (x) +h2

    2!f (x) +

    h3

    3!f (x)

    + + hN

    N !f (N)(x)

    =Nk=0

    hk

    k!f (k)(x)

    and RN(x) =hN+1

    (N + 1)!f (N+1)((x)), where = x + h

    where PN(x) is called the nth Taylor polynomial for f about x and RN(x) is

    called the remainder term (or truncation error) associated with PN (x).However, the whole error can be expressed by

    O(hN+1) =hN+1

    (N + 1)!f (N+1)(x + h), 0 < < 1

    28

  • Since cannot be found exactly, the error term is often approximated by setting = 0:

    O(hN+1) ' hN+1

    (N + 1)!f (N+1)(x)

    which is leading term of the truncation terms.If N = 1, for example, the truncated Taylor series is

    f(x) = f(x) + hf (x), h = x xIncluding the effect of the error, it can also expressed as

    f(x) = f(x) + hf (x) +O(h2)

    where

    O(h2) ' h2

    2!f (x + h), 0 < < 1

    Example:- 4Determine the (a) the second and (b) the third Taylor polynomials for

    f(x) = cos(x) about x = 0 and use these polynomials to approximate cos(0.01).

    (a) For N = 2 and x = 0

    P2(x) = f(0) + (x 0)f (0) + (x 0)2

    2!f (0) +

    (x 0)33!

    f ((x))

    = f(0) + xf (0) +x2

    2!f (0) +

    x3

    3!f ((x))

    therefore,

    cos(x) = cos(0) x sin(0) x2

    2!cos(0) +

    x3

    3!sin((x))

    = 1 x2

    2!+x3

    3!sin((x)), (x) (0, x)

    with x = 0.01

    cos(0.01) = 1 (0.01)2

    2!+(0.01)3

    3!sin((x)), (x) (0, 0.01)

    = 1 0.00012

    +0.000001

    6sin((x)), (x) (0, 0.01)

    = 1 0.00005 + 0.000000166 sin((x)), (x) (0, 0.01)= .99995 + 0.166 106 sin((x)), (x) (0, 0.01)

    29

  • Since sin((x)) < 1, we have

    | cos(0.01) .99995| < 0.166 106,From table we get

    cos(0.01) = 0.99995000042

    (b) The third polynomial about x = 0 is

    P3(x) = f(0) + (x 0)f (0) + (x 0)2

    2!f (0) +

    (x 0)33!

    f (0) +(x 0)4

    4!f iv((x))

    = f(0) + xf (0) +x2

    2!f (0) +

    x3

    3!f (0) +

    (x 0)44!

    f iv((x))

    therefore,

    cos(x) = cos(0) x sin(0) x2

    2!cos(0) +

    x3

    3!sin(0) +

    x4

    4!cos((x))

    = 1 x2

    2!+x4

    4!cos((x)), (x) (0, x)

    with x = 0.01

    cos(0.01) = 1 (0.01)2

    2!+(0.01)4

    4!cos((x)), (x) (0, 0.01)

    = 1 0.00012

    +0.00000001

    24cos((x)), (x) (0, 0.01)

    = .99995 + 4.2 1010 cos((x)), (x) (0, 0.01)

    Since 1 < cos((x)) < 1, we have

    | cos(0.01) .99995| < 4.2 1010,which is better accuracy assurance.

    30