View
88
Download
0
Category
Preview:
DESCRIPTION
Hitting for performance. Cache Performance Analysis. Standard Matrix Multiplication. for (i = 0; i
Citation preview
Faculty of Computer Science
CMPUT 229 © 2006
Cache Performance Analysis
Hitting for performance
© 2006
Department of Computing Science
CMPUT 229
Standard Matrix Multiplication
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
c[i,j] = 0.0;
for(k = 0; k<n ; k++){
c[i,j] = c[i,j] + a[i,k] * b[k,j];
}
}
}
Assume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000
Address(c[0,0]) = $8100000
What is the data cache hit ratio for this program?
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
sum = 0.0;
for(k = 0; k<n ; k++){
temp1 load(a[i,k]);
temp2 load(b[k,j]);
sum sum + temp1*temp2;
}
store(c[i,j]) sum;
}
}
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
© 2006
Department of Computing Science
CMPUT 229
Cache Access AnalysisAssume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000
Address(c[0,0]) = $8100000
What is the data cache hit ratio for this program?
32K-byte cache
128-byte cache line= 256 lines/cache
© 2006
Department of Computing Science
CMPUT 229
Cache Access AnalysisAssume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000
Address(c[0,0]) = $8100000
What is the data cache hit ratio for this program?
128-byte cache lines
8-byte element= 16 elements/line
32K-byte cache
128-byte cache line= 256 lines/cache
© 2006
Department of Computing Science
CMPUT 229
Cache Data Access Pattern
If we ignore conflict misses, then:
Every 16th access of A is a miss;
Every access to B is a miss;
How many hits and misses will occur to compute one element of C?
256 lines/cache
16 elements/line
In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits.
In B there will be 1024 misses.
Thus, what is the hit ratio?
# hits
# of accessesHit ratio = =
960 hits
2048 accesses = 0.47 = 47%
© 2006
Department of Computing Science
CMPUT 229
Address anatomy
The data cache has 32 Kbytes and 128-byte cache lines;
128 = 27
256 = 28
256 lines/cache
16 elements/line
15 14 7 6 031
Tag Index Offset
7 bits8 bits17 bits
© 2006
Department of Computing Science
CMPUT 229
Conflict Misses 256 lines/cache
16 elements/lineCache
Access Address Index Outcome
A[0,0] $80000000 0 miss
B[0,0] $80800000 0 miss
A[0,1] $80000004 0 miss
B[1,0] $80801000 32 miss
A[0,2] $80000008 0 hit
B[2,0] $80802000 64 miss
A[0,3] $8000000C 0 hit
B[3,0] $80803000 96 miss
A[0,4] $80000010 0 hit
B[4,0] $80804000 128 miss
A[0,5] $80000014 0 hit
B[5,0] $80805000 160 miss
A[0,6] $80000018 0 hit
B[6,0] $80806000 192 miss
A[0,7] $8000001C 0 hit
B[7,0] $80807000 244 miss
A[0,8] $80000020 0 hit
B[8,0] $80808000 0 miss
A[0,9] $80000024 0 miss
B[9,0] $80809000 32 miss
0326496128160192244
In General:
A 1024-element row of A
Occupies 64 16-element cache lines.
There will be 2 conflict misses in two of these rows.
A total of 4 conflict misses per row.
Thus the accesses of A will result in 68 misses and
986 hits for each 1024 accesses.
The conflict misses are not significant and can be ignored.
© 2006
Department of Computing Science
CMPUT 229
Matrix Multiplication with Transpose
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
for(k = 0; k<n ; k++){
c[i,j] = c[i,j] + a[i,k] * b1[j,k];
}
}
}
Assume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000
Address(c[0,0]) = $8100000
What is the data cache hit ratio for this program?
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
b1[i,j] = b[j,i];
}
}
Where in memory
should we place
matrix b1 to reduce
conflict misses?
© 2006
Department of Computing Science
CMPUT 229
Where to place matrix b1?
15 14 7 6 031
Tag Index Offset
Intuitively the index of b1[0][0] should be away from the index of a[0][0].
The index of a[0][0] is 0.
Thus we could aim to place b1 at an address whose index is 128.
© 2006
Department of Computing Science
CMPUT 229
Cache Access Pattern for the Transpose
If we ignore conflict misses, then:
Every 16th access of b1 is a miss;
Every access to b is a miss;
The transpose’s inner loop yields:
2048 accesses
960 hits.
And the inner loop is repeated 1024 times:
1024 2048 accesses
1024 960 hitsThus, the hit ratio is:
# hits
# of accessesHit ratio = =
960 hits
2048 accesses = 0.47 = 47%
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
b1[i,j] = b[j,i];
}
}
© 2006
Department of Computing Science
CMPUT 229
Cache Access Pattern for the Multiplication
If we ignore conflict misses, then:
Every 16th access of a is a miss;
Every 16th access to b1 is a miss;
Thus the inner loop yields 2048 accesses
and 1920 hits.
for (i = 0; i<n ; i++){
for(j = 0; j<n ; j++){
sum = 0.0;
for(k = 0; k<n ; k++){
temp1 load(a[i,k]);
temp2 load(b1[j,k]);
sum sum + temp1*temp2;
}
store(c[i,j]) sum;
}
}
The inner loop is executed n2 times.
The total number of accesses (ignoring
accesses to c) in the multiplication is:
1024 1024 2048 accesses
1024 1024 1920 hits
© 2006
Department of Computing Science
CMPUT 229
Hit Ratio for Multiplication with Transpose
1024 960+ 1024 1024 1920 hits
2048 1024 + 1024 1024 2048 accesses Hit ratio =
The total number of accesses (ignoring
accesses to c) in the multiplication is:
1024 1024 2048 accesses
1024 1024 1920 hits
The transpose yields:
1024 2048 accesses
1024 960 hits.
960+ 1024 1920 hits
1025 2048 accesses Hit ratio = = 0.937 = 93.7%
© 2006
Department of Computing Science
CMPUT 229
Blocked Matrix Multiplication*for (i0 = 0; i0<n ; i0 = i0 + b){
for(j0 = 0; j0<n ; j0 = j0 + b){
for(k0 = 0; k0<n ; k0 = k0 + b){
for(i = i0; i< min(i0+b-1,n) ; i++){
for(j = j0; j< min(j0+b-1,n) ; j++){
for(k = k0; j< min(k0+b-1,n) ; j++){
c[i,j] = c[i,j] + a[i,k] * b[k,j];
}
}
}
}
}
•Code adapted from http://www.netlib.org/utk/papers/autoblock/node2.html
Assumes that all elements of matrix c were initialized to zero beforehand
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
2
0
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
3
1
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
2
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
4
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
6
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
8
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
10
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
12
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4
14
Multiplying the first row of the block of A by the block of B required
18 accesses that resulted in 4 misses.
How many of the 18 accesses required to multiply the second row
of the block of A by the block of B will be misses?
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4|1
14|1
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4|1
14|17
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
4| 1 | 1 = 6
14|17|17 = 48
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
What is the hit ratio for the next block multiplication?
4| 1 | 1 = 6
14|17|17 = 48
3 hits and 48 references
In general, there are b misses and 2b3 accesses
2b3 - b
2b3Hit ratio =
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
miss
hit
What is the hit ratio for the next block multiplication?
4| 1 | 1 = 6
14|17|17 = 48
3 hits and 48 references
In general, there are b misses and 2b3 accesses
2b2 - 1
2b2Hit ratio =
© 2006
Department of Computing Science
CMPUT 229
Data Access PatternA B
Assume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
What should be
the value of b?
Do the memory locations
of A and B matter?
miss
hit
© 2006
Department of Computing Science
CMPUT 229
Cache Usage for Blocked Matrix Multiplication
Assume that:
Each matrix element is stored in 8 bytes;
The data cache has 32 Kbytes and 128-byte cache lines;
The data cache is direct associative;
for (i0 = 0; i0<n ; i0 = i0 + b){
for(j0 = 0; j0<n ; j0 = j0 + b){
for(k0 = 0; k0<n ; k0 = k0 + b){
for(i = i0; i< min(i0+b-1,n) ; i++){
for(j = j0; j< min(j0+b-1,n) ; j++){
for(k = k0; j< min(k0+b-1,n) ; j++){
c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } }
Ignore conflict misses.
Estimate the hit ratio for the block computation if b=16.
2b2 - 1
2b2Hit ratio =
2(16)2 - 1
2(16)2Hit ratio =
Hit ratio = 99.8%
Recommended