44
Cache Performance Measurement and Optimization CS61, Lecture 14 Prof. Stephen Chong October 18, 2011

Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Cache Performance Measurement and Optimization

CS61, Lecture 14Prof. Stephen Chong

October 18, 2011

Page 2: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Announcements

•HW 4: Malloc•Final deadline Thursday•You should have received comments on your design

document•Please seek a meeting or feedback from course staff if needed•MMTMOH: Mammoth Multi-TF Malloc Office Hours•Wednesday evening. See website for details

•Midterm exam•Thursday Oct 27•Practice exams posted on iSites

(both College and Extension)

2

Page 3: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Mid-course evaluation

• 43 responses (122 enrolled students)• Pace seems about right

• Some thought too slow, some thought too fast

• Making sections effective• Not compulsory (no in-section quizzes)• Section notes are available on website before section

• Generally on Friday• Encouraged to look at notes before section, figure out where you need to focus

• Lecture videos available to all students• Link from the schedule page

• Feedback• HW1 feedback took time; assignments available for pickup in MD 143

• HW2 and 3 (binary bomb and buffer bomb) feedback was automatic• Will endeavor to give you timely feedback on remaining assignments

3

Page 4: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Dennis Ritchie ’63

•Co-creator of C programming language

•Co-developer (with Ken Thompson) of UNIX operating system•C is the foundation of UNIX

•Undergrad (‘63)and PhD (’68) at Harvard

•Worked at Bell Labs for 40 years•Profound impact on computer

science4

1941-2011

Page 5: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 5

Topics for today

•Cache performance metrics•Discovering your cache's size and performance•The “Memory Mountain”•Matrix multiply, six ways•Blocked matrix multiplication•Exploiting locality in your programs

Page 6: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 6

Cache Performance Metrics

• Miss Rate• Fraction of memory references not found in cache (# misses / # references)

• Typical numbers:

• 3-10% for L1

• Can be quite small (e.g., < 1%) for L2, depending on size and locality.

• Hit Time• Time to deliver a line in the cache to the processor (includes time to determine

whether the line is in the cache)

• Typical numbers: 1-2 clock cycles for L1; 5-20 clock cycles for L2

• Miss Penalty• Additional time required because of a miss

• Typically 50-200 cycles for main memory

• Average access time = hit time + (miss rate × miss penalty)

Page 7: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Wait, what do those numbers mean?

•Huge difference between a hit and a miss•Could be 100x, if just L1 and main memory

•Would you believe 99% hits is twice as good as 97%?•Consider:

cache hit time of 1 cyclemiss penalty of 100 cycles

•Average access time:•97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles

•99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

•This is why “miss rate” is used instead of “hit rate”

7

Page 8: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 8

Writing Cache Friendly Code

•Repeated references to variables are good (temporal locality)

•Stride-1 reference patterns are good (spatial locality)

•Examples:•cold cache, 4-byte words, 4-word cache blocks

int sum_array_rows(int a[M][N]) { int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}

int sum_array_cols(int a[M][N]) { int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}

Miss rate = Miss rate = 1/4 = 25% 100%

Page 9: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 9

Determining cache characteristics

• Say you have a machine but don’t know its cache size or speeds.

• How would you figure these values out?

• Idea: Write a program to measure the cache's behavior and performance.• Program needs to perform memory accesses with different locality patterns.

• Simple approach:• Allocate array of size W words

• Loop over the array repeatedly with stride S and measure memory access time

• Vary W and S to estimate cache characteristics

W = 32 words

S = 4 words

Page 10: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 10

Determining cache characteristics

• What happens as you vary W and S?

• Changing W varies total amount of memory accessed by program• As W gets larger than one cache level, performance of program will drop

• Changing S varies the spatial locality of each access.• If S is less than the size of a cache line, sequential accesses will be fast.

• If S is greater than the size of a cache line, sequential accesses will be slower.

• See end of lecture notes for example C program to do this.

W = 32 words

S = 4 words

Page 11: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

64m$ 16

m$ 4m$1024k$ 25

6k$ 64

k$ 16k$ 4k

$

0$

1000$

2000$

3000$

4000$

5000$

6000$

7000$

8000$

9000$

s1$s5$

s9$s13$

s17$s21$

s25$s29$

Stephen Chong, Harvard University 11

The Memory Mountain

As stride increases, the spatial locality of program gets worse

Working set size(bytes)

Stride (words)

Thro

ughp

ut (M

B/s

ec)

Intel Core i72.7 GHz32 KB L1 d-cache256 KB L2 cache8MB L3 cacheCAVEAT: Tested on a VM

As working set size increases, no longer fits entirely in a cache level, temporal locality worsens.

Page 12: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

64m" 32m" 16m" 8m" 4m" 2m" 1024k" 512k" 256k" 128k" 64k" 32k" 16k" 8k" 4k" 2k"

Stephen Chong, Harvard University 12

Varying Working Set

• Keep stride constant at S = 16 words, and vary W from 1KB to 64MB• Shows size and read throughputs of different cache levels and memory

Rea

d th

roug

hput

(M

B/s

ec)

L1L2L3Mainmemory

Page 13: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 13

Varying stride

•Keep working set constant at W = 256 KB, vary stride from 1-32 words

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

s1"

s2"

s3"

s4"

s5"

s6"

s7"

s8"

s9"

s10"

s11"

s12"

s13"

s14"

s15"

s16"

s17"

s18"

s19"

s20"

s21"

s22"

s23"

s24"

s25"

s26"

s27"

s28"

s29"

s30"

s31"

s32"

Rea

d th

roug

hput

(M

B/s

ec)

One access per cache line

Page 14: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Core i7

14

64M

8M 1M

128K

16K

2K 0

1000

2000

3000

4000

5000

6000

7000

s1

s3

s5

s7

s9

s11

s13

s15

s32 Size (bytes)

Rea

d th

roug

hput

(MB

/s)

Stride (x8 bytes)

Core i7!2.67 GHz!32 KB L1 d-cache!256 KB L2 cache!8 MB L3 cache!

Ridges of temporal locality

L1!

L2!

Mem!

L3!Slopes of spatial locality

Page 15: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 15

Pentium Xeon

Page 16: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 16

Opteron Memory Mountain

L1

L2

Mem

Page 17: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 17

Topics for today

•Cache performance metrics•Discovering your cache's size and performance•The “Memory Mountain”•Matrix multiply, six ways•Blocked matrix multiplication•Exploiting locality in your programs

Page 18: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

• Matrix multiplication is heavily used in numeric and scientific applications.• It's also a nice example of a program that is highly sensitive to cache effects.

• Multiply two N x N matrices• O(N3) total operations

• Read N values for each source element

• Sum up N values for each destination

Stephen Chong, Harvard University 18

Matrix Multiplication Example

void mmm(double *a, double *b, double *c, int n) { int i, j, k; /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }

Variable sumheld in register

Page 19: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 19

Matrix Multiplication Example

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

4 2 7

1 8 2

6 0 1

3 0 1

2 4 5

5 9 1

4×3 2×2 7×5 = 51++

51

× =

Page 20: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 20

Miss Rate Analysis for Matrix Multiply

• Assume:• Line size = 32B (big enough for four 64-bit “double” values)

• Matrix dimension N is very large

• Cache is not big enough to hold multiple rows

• Analysis Method:• Look at access pattern of inner loop

i × =

k

k

j

i

j

A B C

Page 21: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 21

Layout of C Arrays in Memory (review)

• C arrays allocated in row-major order• Each row in contiguous memory locations

• Stepping through columns in one row:• for (i = 0; i < N; i++) sum += a[0][i];

• Accesses successive elements

• Compulsory miss rate: (8 bytes per double) / (block size of cache)

• Stepping through rows in one column:• for (i = 0; i < n; i++) sum += a[i][0];

• Accesses distant elements — no spatial locality!

• Compulsory miss rate = 100%

Page 22: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (ijk)

• 2 loads, 0 stores per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

22

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

B C

Inner loop:

(*,j) (i,j)

A = 0.25 B = 1 C = 0 Total: 1.25

Column-wise

A

(i,*)

Row-wise Fixed

Page 23: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Cache miss analysis

23

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

B C

Inner loop:

(*,j) (i,j)

Column-wise

A

(i,*)

Row-wise FixedFirst iteration:

B CAAfter first iteration in cache (schematic):

4 doubles wide

Page 24: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Cache miss analysis

24

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

B C

Inner loop:

(*,j) (i,j)

Column-wise

A

(i,*)

Row-wise FixedFirst iteration:

B CAAfter first iteration in cache (schematic):

4 doubles wide

Page 25: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (jik)

• Same as ijk, just swapped order of outer loops

• 2 loads, 0 stores per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

25

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

A = 0.25 B = 1 C = 0 Total: 1.25

A B C

(i,*)Inner loop:

Column-wise

Row-wise Fixed

(*,j) (i,j)

Page 26: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (kij)

• 2 load, 1 store per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

26

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A = 0 B = 0.25 C = 0.25 Total: 0.5

B C

Inner loop:

(k,*) (i,*)

A

(i,k)

Row-wiseFixed Row-wise

Page 27: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (ikj)

27

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A = 0 B = 0.25 C = 0.25 Total: 0.5

B C

Inner loop:

(k,*) (i,*)

A

(i,k)

Row-wiseFixed Row-wise

• Same as kij, just swapped order of outer loops

• 2 load, 1 store per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

Page 28: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (jki)

• 2 load, 1 store per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

28

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A = 1 B = 0 C = 1 Total: 2

B C

Inner loop:

(k,j) (*,j)

A

(*,k)

FixedColumn-wise

Column-wise

Page 29: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Matrix Multiplication (kji)

29

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

B C

Inner loop:

(k,j) (*,j)

A = 1 B = 0 C = 1 Total: 2

A

(*,k)

FixedColumn-wise

Column-wise

• Same as kji, just swapped order of outer loops

• 2 load, 1 store per iteration

• Assume cache line size of 32 bytes, so 4 doubles per line

• Misses per iteration:

Page 30: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 30

Summary of Matrix Multiplicationijk or jik: 2 loads, 0 stores misses/iter = 1.25

kij or ikj: 2 loads, 1 store misses/iter = 0.5

jki or kji: 2 loads, 1 store misses/iter = 2.0

for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

Page 31: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

• Each implementation doing same number of arithmetic operations, but ~20× difference!

• Pairs with same number of mem. references and misses per iteration almost identical

Stephen Chong, Harvard University 31

Matrix Multiply Performance

0

25

50

75

100

125

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

ijkjikjkikjikijikj

Cyc

les

per

loop

iter

atio

n

n

Page 32: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

• Miss rate better predictor or performance than number of mem. accesses!

• For large N, kij and ikj performance almost constant.Due to hardware prefetching, able to recognize stride-1 patterns.

Stephen Chong, Harvard University 32

Matrix Multiply Performance

jki and kji: 2 loads, 1 store misses/iter = 2.0

ijk and jik: 2 loads, 0 stores misses/iter = 1.25

kij and ikj: 2 loads, 1 store misses/iter = 0.50

25

50

75

100

125

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

ijkjikjkikjikijikj

Cyc

les

per

loop

iter

atio

n

n

Page 33: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 33

Topics for today

•Cache performance metrics•Discovering your cache's size and performance•The “Memory Mountain”•Matrix multiply, six ways•Blocked matrix multiplication•Exploiting locality in your programs

Page 34: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 34

Using blocking to improve locality

• Blocked matrix multiplication• Break matrix into smaller blocks and perform independent multiplications

on each block.

• Improves locality by operating on one block at a time.

• Best if each block can fit in the cache!

• Example: Break each matrix into four sub-blocks

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22× =

Page 35: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Blocked Matrix Multiply

• Partition arrays into bsize × bsize chunks

• Innermost (i1, j1, k1) loop pair multiplies an A chunk by a B chunk and accumulates result in a C chunk

35

void bmmm(int n, double a[n][n], double b[n][n], double c[n][n]) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1][j1] += a[i1][k1] * b[k1][j1];}

Code becomes harder to read!Is it worth it?

Tradeoff between performance and maintainability...

Page 36: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Blocked matrix multiply

•Assume 3 chunks can fit into the cache, i.e., 3bsize2 < C

•First block iteration

•After first iteration in cache (schematic)

36

A B C

n/B chunks

A B C

Page 37: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Cache miss analysis

• Assume 3 chunks can fit into the cache

• Assume bsize is a multiple of 4

• bsize2/4 misses per chunk, so 3/4 × bsize2 misses per chunk iteration

• (n/bsize)3 chunk iterations

• Total of (n/bsize)3 × 3/4 × bsize2 misses = n3 × 3/(4 * bsize)

• Compare with n3 × 1/2 total misses for kij algorithm

37

A B C

Page 38: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Blocked Matrix Multiply Performance

38

0

15

30

45

60

25 75 125 175 225 275 325 375

Cyc

les/

itera

tion

Array size (n)

kjijkikijikjjikijkbijk (bsize = 25)bikj (bsize = 25)

Pentium III Xeon 550 Mhz

• Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)• Relatively insensitive to array size.

Page 39: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 39

Blocked Matrix Multiply Performance

Intel Core i72.7 GHz32 KB L1 d-cache256 KB L2 cache8MB L3 cacheCAVEAT: Tested on a VM

0

25

50

75

100

125

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

ijkjikjkikjikijikjbijkbikj

Cyc

les

per

loop

iter

atio

n

n

Page 40: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 40

Blocked Matrix Multiply Performance

Intel Core i72.7 GHz32 KB L1 d-cache256 KB L2 cache8MB L3 cacheCAVEAT: Tested on a VM

0

2

4

6

8

10

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

kijikjbijkbikj

Cyc

les

per

loop

iter

atio

n

n

Page 41: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Exploiting locality in your programs

•Focus attention on inner loops•This is where most computation and memory accesses

in your program occurs

•Try to maximize spatial locality•Read data objects sequentially, with stride 1, in the

order they are stored in memory

•Try to maximize temporal locality•Use a data object as often as possible once it has been

read from memory

41

Page 42: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University

Next lecture

•Virtual memory•Using memory as a cache for disk

42

Page 43: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 43

Cache performance test program/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink;

for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */}

/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride){ uint64_t start_cycles, end_cycles, diff; int elems = size / sizeof(int);

test(elems, stride); /* warm up the cache */ start_cycles = get_cpu_cycle_counter(); /* Read CPU cycle counter */ test(elems, stride); /* Run test */ end_cycles = get_cpu_cycle_counter(); /* Read CPU cycle counter again */ diff = end_cycles – start_cycles; /* Compute time */ return (size / stride) / (diff / CPU_MHZ); /* convert cycles to MB/s */}

Page 44: Cache Performance - CS61hroughput (MB/sec) Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM As working set size increases, no longer fits

Stephen Chong, Harvard University 44

Cache performance main routine

#define CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

int main(){ int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride)); printf("\n"); } exit(0);}