w1413863

EECT615 Computer Architecture and Performance

Aminuzzaman Khan W1413863

EECT615 Computer Architecture and Performance Aminuzzaman Khan

W1413863

Dr Sevket



Table of Contents

Introduction 2

Benchmark Code Development 3

Graphical Results 4

Analysis 5

Conclusions 6

References 6

Appendix 7

1



Introduction

This coursework assignment is to research how cache can affect the performance of

a CPU. The coursework assignment has three parts: creating a simple benchmark

program using the following “daxpy” computation y[k] = a * x[k] + y[k]; x [ ] and y [ ] are

double precision static arrays. A is a double precision constant and K is unit stride.

The second part is to create a graphical presentation after running the benchmark.

And finally to create a report to show how these graphical results will be analysed.

I was given a template called bench_template.c to help create my benchmark. Using

this template, I must create a program that measures the time for unit stride and strided

access which has the value of stride = 5 which is a suitable value. The result of the

benchmark should show a table with 5 columns with the following headings: size of

array in bytes, time for unit stride, time for strided access, unit stride MFLOPS

performance, and strided MFLOPS performance. The size of the arrays should start

from 1 and follow the sequence 1,2,4,16,32….219 and go up to 524288.



Benchmark Code Development

Code I used for loops to limit MAXVLEN, this allowed my results to stay within a limit and not give me negative results. Otherwise it will go beyond the MAXVLEN size and give me odd results like negative values. I used functions to gettimeofday() to calculate the start and end time of the daxpy computation which allowed me to find the time taken which then was used to find the MFLOP/s for strided and unit stride.

To find the MFLOPS/s, the formula is number of operations/execution time*106 . The number of operations was 2 because the daxpy computation had a multiplication and an addition. The execution time was the difference in time for the for loops divided by the number of iterations divided by vlen. Arrays are used for the memory assignment

calculations, The Arrays are initialised to a MAXVLEN size of 219 which I expressed as 1024*512. Compilation When I compiled the benchmark, I used gcc amin.c, however when running the code using ./a.out, there results were not optimised. I used gcc -O3 amin.c to optimise and this makes a significant difference because the values I received were slightly faster. This also effected the graph because if I didn’t use optimisation, there would be hardly anything to see in the graph. Timing Blocks I used the gettimeofday() funcation. This fucation can get and set the time aswell as the timezone. It is placed as seconds and microseconds. (thetime.tv_sec +thetime.tv_usec*). I then had to make sure my meaturements are corrent by using the correct units such as converting seconds to nanoseconds by times it with 1.0e9 and micro to nanoseconds by times it with 1.0e3. Repeated Measurements The for loop I created for both strided access and unit stride are repeated within a

timing block. This is then divided by the number of iterations and this improves

precision for short loops. The repeated measurements relate to gettimeofday() as the

function is called before and after the for loops have been run. This is because it can

allow for easy computation and we can measure the start and end time of the

computation of the daxpy code.

Memory Assignment Memory for the arrays is allocated statically. This is done by storing them into cache temporary which speeds up the process. x[vlen] = 7.00;y[vlen] = 0.00; One is sequential and the other is non sequential.

2



Graphical Results

Here are results for daxpy operations on the Debian machine which I used in G.105. The line in blue is for unit stride and the

orange line is for strided access.

Intel Core™2 Duo E7500

Frequency = 2933MHz Bus Speed = 1066MHz

Clock Multiplier = 11

Number of Cores = 2. Number of Threads = 2

Floating point unit = Integrated

Level 1 cache size = 2 x 32 KB 8-way set associative instruction caches and data caches.

Level 2 cache size = shared 3 MB 12-way set associative cache

Multiprocessing = Uniprocessor

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

4000.00

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288

uMFLOP/s sMFLOP/s

4



Analysis

1.) Estimate the peak performance for your code on the target machine; - discuss further the efficiency by comparing the estimated peak performance and the best performance from your experiments. My estimated peak performance was roughly 3000 MFLOP/s however I managed to get a peak performance of 3603.60MFLOP/s. The reason why I thought it would have been 3000MFLOP/s was because I thought it may have been linked to the GHz of the processer I used which was 2.93 GHZ. I also thought because the machine had 2 cores, it would have a better performance.

2.) Estimate the size(s) of the cache(s)? Explain your method by identifying a feature you think is relevant on the graph and carefully arguing how and why this feature relates to the cache size.

On the graph above, there are a few points where the graph seems to drop significantly, one from 16 to 32 and another from 512 to 4096. From this result, it can allow me to see at what point the performance dropped, and this is because of switching from L1 cache to L2 cache. This is because Level 1 cache size is 2 x 32 KB 8-way set associative instruction caches and 2 x 32 KB 8-way set associative data caches. Level 2 cache size shared 3 MB 12-way set associative cache. [1] So therefore at roughly 32KB and 3MB the level of cache changed.

3.) Explain (not just describe) the way the graph changes between the unit strided and strided measurements. Unit stride has no counter, and the stride changes from 1 to 5, the performance is lower than unit stride and the hit rate is decreased. Also the memory access is less local and the drops of performance to the cache is due to the small array sizes.

4.) On the basis of parameters you look up or guess and include in your

report. Estimate the peak performance for daxpy on the G.105 machines. How does this compare with the best performance you measured? The highest floating point performance is about 3.6 GFLOPS for the Intel Core™2 Duo E7500. This is slightly higher by 25% than the peak performance of 2.93 GFLOPS computed from the clock rate 2.93 GHz assuming one cycle per floating point operation. This may happen due to the processers speed up features such as Enhanced Intel SpeedStep Technology which enables high performance by increasing the frequency. [2]

5



Conclusion

Overall I found this coursework challenging however I did enjoy running benchmarks

on various machines to compare the performance. At the start I had a problem with

getting my benchmark to work and when it did the values were extreme and got -

10000+ MFLOPS/s. I then fixed this problem and the results stated to look more

reasonable. This coursework has given me a visual understanding of how cache works

on a CPU and the effect when optimising then running it without optimisation. I think I

could have done better by understanding how the daxpy computation effects the CPU

as well as understanding more about the CPU I benchmarked and why it acts the way

it did in my tests. My prediction was that the CPU would have had a peak performance

of ~3000 MFLOP/s however in some tests I did with other CPUs which ran on

~3.5GHz, I got results in the peak performance of 1200 MFLOPS for a AMD FX 8320

8 core CPU. It could be due to the different manufacturers or the architecture of the

CPU in terms of the number of executions it can handle or the fact that the AMD CPU

has 4 x 64KB L1 cache and 4 x 2MB cache unlike the Intel CPU. This could mean that

the benchmark could have stayed within the L1 cache as it seems to be the same size

as the L1 cache and L2 cache of the Intel CPU.

References

[1] Gennadiy Shvets. (2009). Intel Core 2 Duo E7500 specifications. Available: http://www.cpu-

world.com/CPUs/Core_2/Intel-Core%202%20Duo%20E7500%20AT80571PH0773M%20-

%20AT80571PH0773ML%20(BX80571E7500).html. Last accessed 29th Mar 2015.

[2] Intel. (2009). Intel® Core™2 Duo Processor E7500 (3M Cache, 2.93 GHz, 1066 MHz

FSB). Available: http://ark.intel.com/products/36503/Intel-Core2-Duo-Processor-E7500-

3M-Cache-2_93-GHz-1066-MHz-FSB. Last accessed 30th Mar 2015.

6



Appendix

Print out of my commented source code.

//Aminuzzaman Khan W1413863 #include <sys/time.h> #include <stdio.h> #define MAXVLEN (1024*512) #define NUMBEROFITERATIONS 200 #define BIGSTRIDE 5 //timeofday function static double time_us() { struct timeval thetime; gettimeofday(&thetime,NULL);//using gettimeofday() return (double) (thetime.tv_sec *1.0e9 +thetime.tv_usec*1.0e3); } int main (void) { //initialize variables int i,o,l,j,ministride; static double x[MAXVLEN], y[MAXVLEN]; int vlen; double a = 2.0; double t_start, t_end, t1_start, t1_end, uMFLOP, sMFLOP, uSec, sSec; //headings //printf("%.11s \t %.9s \t %.10s \t %.10s \t %.10s \t %.10s\n","Vector SIZE","uTime[ns]", "uMFLOP/s","STRIDE","sTIME[ns]","sMFLOP/s"); printf("%s","Vector SIZE uTime[ns] uMFLOP/s STRIDE sTIME[ns] sMFLOP/s\n"); //used to make the table look aligned with the results printf("----------------------------------------------------------------------------\n"); for (vlen = 0; vlen <MAXVLEN; vlen++) { x[vlen] = 7.00; y[vlen] = 0.00; } // Big loop which repeats 2^19 times for(vlen=1;vlen<=MAXVLEN;vlen*=2) { t_start = time_us();// uses timeofday function to get start time //unit stride for (o = 0; o < NUMBEROFITERATIONS; o++)

7



{ for (l = 0; l < vlen; l++ ) { y[l] = a* x[l] + y[l]; } } t_end = time_us(); //calculation for unit stride uSec = (t_end - t_start) / NUMBEROFITERATIONS / vlen; uMFLOP = 2.0/uSec*1000; //stride is max 5 if (vlen <= BIGSTRIDE) { ministride = vlen; } else { ministride = BIGSTRIDE; } t1_start = time_us(); for (o = 0; o < NUMBEROFITERATIONS; o++) { j=0; for (l = 0; l < vlen; l += ministride) { y[l] = a* x[l] + y[l]; j++; } } t1_end = time_us(); //calculation for strided access sSec = (t1_end - t1_start) / NUMBEROFITERATIONS / j; sMFLOP = 2.0/sSec*1000; //printing results printf("%d\t %10.2f\t %10.2f\t %d\t %10.2f\t %10.2f\t", vlen, uSec, uMFLOP, ministride, sSec, sMFLOP);

8



}//end of big loop }// main

Printout of the output of my code

Vector Size uTime[ns] uMFLOP/s STRIDE sTIME[ns] sMFLOP/s

1 10.24 195.31 1 5.12 390.62

2 5.12 390.62 2 5.12 390.62

4 2.56 781.25 4 3.84 520.83

8 0.64 3125.00 5 5.12 390.62

16 0.64 3125.00 5 2.56 781.25

32 0.96 2083.33 5 2.19 911.46

64 0.62 3225.81 5 1.58 1269.53

128 0.63 3174.60 5 1.53 1310.48

256 0.56 3539.82 5 1.16 1728.72

512 0.56 3603.60 5 1.45 1375.53

1024 0.68 2946.59 5 2.54 787.01

2048 1.09 1828.57 5 3.27 611.87

4096 1.93 1038.96 5 2.98 672.22

8192 1.03 1943.52 5 2.13 937.73

16384 1.02 1968.63 5 3.40 588.54

32768 1.25 1597.25 5 2.97 673.42

65536 1.04 1923.44 5 5.07 394.37

131072 1.93 1038.26 5 8.07 247.71

262144 2.07 968.42 5 9.34 214.12

524288 3.04 657.57 5 13.77 145.20

9

Documents

w1413863