12
Physics 420/580: Assignment 1 Solutions 1. (a) Take the complement of each bit (i.e., exchange 0 1) and add 1. In this way 3 goes to -3 as follows: 00000011 11111100 + 1 = 11111101. (b) The most negative representable number -2 N -1 is its own two’s complement. When N = 8, the number -128 is invariant: 10000000 01111111 + 1 = 10000000. (c) For conventional a binary number, the leftmost bit is the most significant. Ad- dition of two such numbers overflows if the most significant carry digit is 1; that is, if the two most significant carry bits are 10 or 11. On the other hand, the leftmost bit in two’s complement encodes the sign information. Overflow in this case occurs if two positive numbers add to give a negative number or if two negative numbers add to give a positive number. This occurs if the two most significant carry bits are 10 or 01. An exhaustive example for N = 2 is given on the next page: unsigned binary 3 11 -1 two’s complement 2 10 -2 1 01 1 0 00 0

Physics 420/580: Assignment 1 Solutions - Page Not Found · Physics 420/580: Assignment 1 Solutions 1.(a) Take the complement of each bit (i.e., exchange 0 $1) and add 1. In this

Embed Size (px)

Citation preview

Physics 420/580: Assignment 1 Solutions

1. (a) Take the complement of each bit (i.e., exchange 0↔ 1) and add 1. In this way 3goes to −3 as follows: 00000011→ 11111100 + 1 = 11111101.

(b) The most negative representable number −2N−1 is its own two’s complement.When N = 8, the number −128 is invariant: 10000000→ 01111111 + 1 = 10000000.

(c) For conventional a binary number, the leftmost bit is the most significant. Ad-dition of two such numbers overflows if the most significant carry digit is 1; that is,if the two most significant carry bits are 10 or 11. On the other hand, the leftmostbit in two’s complement encodes the sign information. Overflow in this case occurs iftwo positive numbers add to give a negative number or if two negative numbers addto give a positive number. This occurs if the two most significant carry bits are 10or 01.

An exhaustive example for N = 2 is given on the next page:

unsigned binary 3 11 −1 two’s complement2 10 −21 01 10 00 0

0 1 2 3

0 0 1 2 3

1 1 2 3 !

2 2 3 ! !

3 3 ! ! !

0 1 "2 "1

0 0 1 "2 "1

1 1 ! "1 0

"2 "2 "1 ! !

"1 "1 0 ! "2

00 00 00 00

00 01 00 01

00 10 00 10

00 11 00 11

00 00 01 01

01 01 01 10

00 10 01 11

11 11 01 00

00 00 10 10

00 01 10 11

10 10 10 00

10 11 10 01

00 00 11 11

11 01 11 00

10 10 11 01

11 11 11 10

00

00

01

01

10

10

11

11

unsigned binary two’s complement

2. Octal digits are encoded by 3 binary digits and hexadecimal digits by 4. Regroupthe digits of the intermediate binary step and translate:

128 = 1 : 0102

= 10102 = a16

56558 = 101110 : 1011012

= 1011 : 1010 : 11012 = bad16

25502768 = 10 : 110111 : 000010 : 1111102

= 1011 : 0111 : 0000 : 1011 : 11102 = ad0be16

765453368 = 111110 : 101100 : 101011 : 0111102

= 1111 : 1010 : 1100 : 1010 : 1101 : 11102 = facade16

37267558 = 10 : 111010 : 110111 : 1011012

= 1011 : 1010 : 1101 : 1110 : 11012 = faded16

If you’re feeling lazy, have the computer do the conversion for you.

#include <iostream>using std::cout;using std::endl;

int main(){

const unsigned int i[5] = { 012, 05655, 02550276, 076545336, 03726755 };cout.setf(std::ios::showbase);for (int n = 0; n < 5; ++n)

cout << std::dec << i[n] << " = "<< std::oct << i[n] << " = "<< std::hex << i[n] << endl;

return 0;}

The output from the program above is

10 = 012 = 0xa2989 = 05655 = 0xbad708798 = 02550276 = 0xad0be16435934 = 076545336 = 0xfacade1027565 = 03726755 = 0xfaded

The C++ convention is that the prefix 0 implies octal and 0x implies hexadecimal.

3. (a) Start by breaking the number up into its contributions at each order of magnitude:

(· · · a3a2a1a0.a−1a−2 · · · )b = · · ·+ a2b2 + a1b + a0 + a−1b

−1 + · · ·= · · · (· · · 00 a3︸︷︷︸

3

00 · · · )b + (· · · 00 a2︸︷︷︸2

00 · · · )b

Since −akbk = bk+1 − bbk − akb

k = bk+1 − (|b| − ak)bk, negation can be representedby the sum

−(· · · a3a2a1a0.a−1a−2 · · · )b =∑

k

(· · · 00 1︸︷︷︸k−1

|b| − ak︸ ︷︷ ︸k

00 · · · )b

Some tedious computation leads to

1322−4 = −22 22 = 232−4

133.21−5 = 12.64 −12.64 = 33.44−5

2050.23−6 = −462.25 462.25 = 14111.53−6

6231.46−8 = −2967.40625 2967.40625 = 13750.52−8

(b) The representation is unique for all numbers expressible with a finite numberof digits. Uniqueness fails for numbers with infinite repeating digits after the radixpoint. For example, consider the two numbers

n1 = (0.a0a0a0a0 · · · )b =ab−1

1− b−2=

ab

b2 − 1.

n2 = (0.0a0a0a0a · · · )b =ab−2

1− b−2=

a

b2 − 1.

Suppose that n1 and n2 differ by unity. This is possible if

n2 − n1 =a

b2 − 1− ab

b2 − 1= −a(b− 1)

b2 − 1= − a

b + 1= 1.

In other words, there is an equivalence

(1.0a0a0a0a · · · )b = (0.a0a0a0a0 · · · )b if a = |b| − 1.

Some examples:

(1.020202 · · · )−3 = (0.202020 · · · )−3

(1.030303 · · · )−4 = (0.303030 · · · )−4

(1.040404 · · · )−5 = (0.404040 · · · )−5

(c) It is clear from our description of the negation procedure that the negative signcan always be absorbed by writing the number with its leading digit at one orderhigher.

4. (a) The largest number is 654321! = 6 · 6! + 5 · 5! + 4 · 4! + 3 · 3! + 2 · 2! + 1 = 5039.The smallest number is 000000! = 0.

(b) All intermediate numbers are represented. This follows immediately from therelation

(0000 k︸︷︷︸k

k − 1︸ ︷︷ ︸k−1

· · · 21)! + 1 = (000 1︸︷︷︸k+1

0︸︷︷︸k

0 · · · 00)!,

which can be proved as follows:

(k + 1)! = (k + 1)k! = k · k! + k!= k · k! + k(k − 1)! = k · k! + (k − 1 + 1)(k − 1)!= k · k! + (k − 1)(k − 1)! + (k − 1)!...

= 1 +k−1∑l=1

l · l!

5. (a) The requested additions are given below.

a + c = (+, 54, 15687) + (−, 46, 88280)= (+, 54, 15687) + (−, 56, 00000)= (+, 54, 15687)

b + c = (+, 56, 19300) + (−, 46, 88280)= (+, 56, 19300) + (−, 56, 00000)= (+, 56, 19300)

(b) The rule for multiplication has several parts. First, we must specify how the signbit is set:

++→ ++− → −−+→ −−− → +

Second, we must specify how the new exponent is computed. One possible rule is tosay that e = e1 + e2 + d[f1, f2] − 55, where d is the number of digits exceeding 5 inthe result of multiplying the fractional parts.

fractional parts: 15687× 19300 = 30275︸ ︷︷ ︸5

9100︸︷︷︸d=4

a× b = (+, 54, 15687)× (−, 56, 19300)= (+, 54 + 56 + 4− 55, 30275)= (+, 59, 30275)

fractional parts: 15687× 88280 = 13848︸ ︷︷ ︸5

48360︸ ︷︷ ︸d=5

a× c = (+, 54, 15687)× (−, 46, 88280)= (−, 54 + 46 + 5− 55, 13848)= (+, 50, 13848)

fractional parts: 19300× 88280 = 17038︸ ︷︷ ︸5

04000︸ ︷︷ ︸d=5

b× c = (+, 56, 19300)× (−, 46, 88280)= (−, 56 + 46 + 5− 55, 17038)= (+, 52, 17038)

(c) In the case of addition, all the significant digits of the smaller number are lostwhen adding two numbers whose orders of magnitude differ by more than 5. A full5 digits of significance is retain in all of the multiplication examples.

6. The summation in both decreasing and increasing order can be carried out with thisprogram:

#include <cstdlib>using std::atoi;

#include <cassert>

#include <iostream>using std::cout;using std::endl;

#include <iomanip>using std::setw;

#include <limits>using std::numeric_limits;

int main(){

const unsigned int cutoff = 100000000;float sum = 0.0;cout.precision(numeric_limits<float>::digits10);for (unsigned int n = 1; n <= cutoff; ++n)

sum += 1.0/n;cout << "Decreasing order sum = " << sum << endl;sum = 0.0;for (unsigned int n = cutoff; n >= 1; --n)

sum += 1.0/n;cout << "Increasing order sum = " << sum << endl;

return 0;}

The output shows quite a marked discrepancy between the two methods:

Decreasing order sum = 15.4037Increasing order sum = 18.8079

The increasing-order result is better. In that case, only numbers of the same order ofmagnitude are added. In decreasing order, however, there is a gross loss of significancebecause in the long tail small numbers are being added to very large ones.When the float variable is replaced with a double, the same program produces

Decreasing order sum = 18.9978964138526Increasing order sum = 18.9978964138534

For a long double, it gives

Decreasing order sum = 18.9978964138539051Increasing order sum = 18.9978964138538980

For many applications in computation physics, you will find that single-precisioncalculations are dangerous.

7. #include <iostream>using std::cout;using std::endl;

#include <iomanip>using std::setw;

#include <cmath>

using std::sqrt;using std::pow;

inline double scale(int i, double t) { return 6.0*pow(2.0,i)*t; }

int main(){

double tA, tB;tA = tB = sqrt(3.0)/3.0;cout.precision(16);cout << setw(8) << 0 << setw(25) << 6.0*tA

<< setw(25) << 6.0*tB << endl;for (int i = 1; i < 50; ++i){

tA = (sqrt(tA*tA+1.0)-1.0)/tA;tB = tB/(sqrt(tB*tB+1.0)+1.0);cout << setw(8) << i << setw(25) << scale(i,tA)

<< setw(25) << scale(i,tB) << endl;}cout << setw(8) << "inf" << setw(25) << "" << setw(25) << M_PI << endl;

return 0;}

0 3.464101615137754 3.4641016151377541 3.215390309173471 3.2153903091734722 3.159659942097494 3.1596599420975013 3.146086215131401 3.1460862151314354 3.142714599645314 3.1427145996453695 3.141873049980126 3.141873049979824

...

23 3.140006864691228 3.14159265358979924 3.224515243534553 3.14159265358979725 2.791117213058869 3.14159265358979626 0 3.14159265358979627 nan 3.14159265358979628 nan 3.141592653589796

...

45 nan 3.14159265358979646 nan 3.141592653589796

47 nan 3.14159265358979648 nan 3.14159265358979649 nan 3.141592653589796

inf 3.141592653589793

The sequence shown in the first column is calculated via repeated application oftA = (sqrt(tA*tA+1.0)-1.0)/tA. The value of tA roughly halves with each step.Eventually tA*tA becomes so insignificant in comparison to 1.0 that the numeratoreffectively becomes sqrt(1.0)-1.0. Thus tA vanishes and a divde-by-zero nan isproduced at the next step.

8. There are many ways to calculate the Mandelbrot set.You could use two doubles to represent the real and imaginary parts of z, store theescape counts in an array, and dump the escape counts to stdout at the end of thecomputation:

#include <cstddef>using std::size_t;

#include <iostream>using std::cout;using std::endl;

int main(){

const size_t M = 450;const size_t N = 300;unsigned int escape[M][N];const unsigned int nmax = 500;const double rmax_sqr = 9.0;

for (size_t i = 0; i < M; ++i)for (size_t j = 0; j < N; ++j){

const double real_c = 3.0*i/M-2.0;const double imag_c = 2.0*j/N-1.0;unsigned int n = 0;double real_z = real_c;double imag_z = imag_c;while (n < nmax and real_z*real_z + imag_z*imag_z < rmax_sqr){

const double real_z_updated = real_z*real_z-imag_z*imag_z + real_c;const double imag_z_updated = 2.0*real_z*imag_z + imag_c;real_z = real_z_updated;imag_z = imag_z_updated;

++n;}escape[i][j] = n;

}

for (size_t j = 0; j < N; ++j){

for (size_t i = 0; i < M; ++i)cout << escape[i][j] << " ";

cout << endl;}

return 0;}

You could instead take advantage of the complex class provided by C++:

#include <cstddef>using std::size_t;

#include <iostream>using std::cout;using std::endl;

#include <complex>using std::complex;

int main(){

const size_t M = 450;const size_t N = 300;unsigned int escape[M][N];const unsigned int nmax = 500;const double rmax_sqr = 9.0;

for (size_t i = 0; i < M; ++i)for (size_t j = 0; j < N; ++j){

complex<double> c( 3.0*i/M-2.0, 2.0*j/N-1.0 );unsigned int n = 0;complex<double> z(c);while (n < nmax and norm(z) < rmax_sqr){

z = z*z +c;++n;

}

escape[i][j] = n;}

for (size_t j = 0; j < N; ++j){

for (size_t i = 0; i < M; ++i)cout << escape[i][j] << " ";

cout << endl;}

return 0;}

You could also do away with the array storage entirely by outputting the escapecounts as you go:

#include <cstddef>using std::size_t;

#include <iostream>using std::cout;using std::endl;

int main(){

const size_t M = 450;const size_t N = 300;const unsigned int nmax = 500;const double rmax_sqr = 9.0;

for (size_t j = 0; j < N; ++j){

for (size_t i = 0; i < M; ++i){

const double real_c = 3.0*i/M-2.0;const double imag_c = 2.0*j/N-1.0;unsigned int n = 0;double real_z = real_c;double imag_z = imag_c;while (n < nmax and real_z*real_z + imag_z*imag_z < rmax_sqr){

const double real_z_updated = real_z*real_z-imag_z*imag_z + real_c;const double imag_z_updated = 2.0*real_z*imag_z + imag_c;real_z = real_z_updated;imag_z = imag_z_updated;++n;

}cout << n << " ";

}cout << endl;

}

return 0;}

The gnuplot output should look something like this.