Upload
michael-burt
View
33
Download
1
Embed Size (px)
DESCRIPTION
Apex-Map Status Erich Strohmaier and Hongzhang Shan. Apex-Map generator. Benchmark code will be generated based on the following performance parameters: PARALLEL: N/Y PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF ACCESS PATTERN: RANDOM / STRIDE - PowerPoint PPT Presentation
Citation preview
Apex-Map Status
Erich Strohmaier and Hongzhang Shan
Apex-Map generator
• Benchmark code will be generated based on the following performance parameters:– PARALLEL: N/Y– PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF– ACCESS PATTERN: RANDOM / STRIDE – SPATIAL LOCALITY (L): [1, M] Default: {1, 4, 16, …, 65536}– CONCURRENCY (I) : [1, X] Default: 1024– TEMPORAL LOCALITY (a): [0,1] Default: {1.0 0.5 0.25 0.1 0.05 0.025 0.01 0.005 0.0025 0.001}– MEMORY SIZE (M) : Default: 67,108,864 Words = 512MB / process– REGISTER PRESSURE ( R ): [1, X] Default: 1– COMPUTATIONAL INTENSITY (CI) : [1, X] Default: 1– ACCESS MODE: FUSED / NESTED – RESULTS: SCALAR / ARRAY (left hand side of statement)– REPEAT TIMES: 100– WARMUP TIMES: 10– CPU MHZ: 1900– PLATFORM: BASSI– VERSION: 1.6– STRIDE: X
– X: any positive integer
Apex-Map Meets Kernels
HPCC Stream
HPCC GUPS
NAS CG NBODY MM-Stride
MM-Vector
Pattern Random Random Random Random Stride Random
Temp Locality 1 1 0.01-0.03 1 0.02
Spatial Locality N 1 1 4 K
Mem N Table Matrix+ Vector
Mem Matrix b
Reg. Pressure 1 1 1 1 1
Comp. Intensity 1 1 1 16 1
Concurrency1 Nupdate Matrix /
Vector1 M
Results Array Array Scalar Scalar Array
Access Mode Nested Nested Fused Nested Nested
Stride N
NAS CG (one stream)
Source Code:==========DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum
ENDDO
Apex-Map Stream
Pattern Random
Temp Locality ???
Spatial Locality 1
Mem E + N
Reg. Pressure 1
Comp. Intensity 1
Concurrency E/N
Results SCALAR
Access Mode FUSED
One-Stream Approach: using one Apex-Map stream to simulate NAS CG performance behavior. Temporal locality currently needs to be defined by experiments.
Performance Prediction for CG (using one stream)
Jacquard
0
100
200
300
400
500
600
700
800
900
S W A B C
MF/s
Jacquard
Apex-Map 0.03
Apex-Map 0.02
Apex-Map 0.01
Franklin
0
100
200
300
400
500
600
700
S W A B C
MF/s
CG
Apex-Map 0.03
Apex-Map 0.02
Apex-Map 0.01
The results indicate that the performance of CG for different data sets can be simulated by Apex-Map using one stream with temporal locality ranging from 0.03 - 0.01 (exception: data set S on Jacquard)
NAS CG (two streams)
Source Code:===========DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sumENDDO
Apex-Map Stream1 Stream2
Pattern Random Random
Temp Locality 1 1
Spatial Locality 1 E/N
Mem N_row E_matrix
Reg. Pressure 1 1
Comp. Intensity 1 1
Concurrency E/N N
Results SCALAR SCALAR
Access Mode FUSED NESTED
Two-Stream Approach: (a, p are treated differently)Perf. of CG = 1/(1/Perf_stream1+1/Perf_stream2)
Performance Prediction for CG (using two streams)
0
100
200
300
400
500
600
700
S W A B C
MF/s
Franklin, CG
Franklin, Apex-Map
Jacquard, CG
Jacquard, Apex-Map
Using two-stream approach, performance matches very well on Jacquard. However, on Franklin, only large data sets match well.
GUPS
Source Code:==========For ( i = 0; i < NUPDATE; i++) { ran = (ran << 1)^ (((s64int) ran < 0) ? POLY : 0); Table[ran & (TableSize -1)] ^= ran;}
Apex-Map Stream
Pattern Random
Temp Locality 1
Spatial Locality 1
Mem TableSize
Reg. Pressure 1
Comp. Intensity 1
Concurrency NUPDATE
Results ARRAY
Access Mode NESTED
60
70
80
90
100
110
120
130
140
150
32 64 128
MB
/s
Franklin, GUPS
Franklin, Apex-Map
Jacquard, GUPS
Jacquard, Apex-Map
Results Match Well!
Matrix-Mul (stride)
Source Code:==========For ( i = 0; i < N; i++) { For ( j = 0; j < K; j++) { tmp = 0; For ( k = 0; k < M; k++) { tmp += a[i*M+k] * b[k*K+j]; } c[i*K+j] = tmp; }}
Apex-Map Stream Stream
Pattern Random Stride
Temp Locality 1 Step: K
Spatial Locality 1
Mem Matrix b Matrix b
Reg. Pressure 1
Comp. Intensity 1
Concurrency M
Results SCALAR
Access Mode NESTED
There are two choices for Apex-Map:1. Use random stream2. Use stride stream
Performance Prediction for Matrix-Mul (stride)
0
10
20
30
40
50
60
70
80
90
2048 4096
MF/s
Franklin,MM
Franklin,Apex-Map,Random
Franklin,Apex-Map,Stride
Jacquard,MM
Jacquard,Apex-Map,Random
Jacquard,Apex-Map,Stride
1. Stride stream matches well.2. Big performance gap between MM and Apex-Map
using random stream
Matrix-Mul (vector)
Source Code:==========For ( i = 0; i < N; i++) For ( k = 0; k < M; k++) For ( j = 0; j < K; j++) c[i*K+j] += a[i*M+k] * b[k*K+j];
Apex-Map Stream
Pattern Random
Temp Locality ???
Spatial Locality K
Mem Matrix b
Reg. Pressure 1
Comp. Intensity 1
Concurrency M
Results ARRAY
Access Mode NESTED0
100
200
300
400
500
600
700
2048 4096
MF/s
Franklin, MM
Franklin, Apex-Map, 1.0
Franklin, Apex-Map, 0.02
Jacquard, MM
Jacquard, Apex-Map, 1.0
Jacquard, Apex-Map, 0.02
On Franklin, perf. Matches well when temp. locality is 0.02. On Jacquard, not a close match (compiler inefficiency for Apex-Map kernels ?)
NBODY
Source Code (Loop Body):===================SUBVEC(p->position, bod->position, diff)DOTPROD(diff, diff, distSq)distSq += SOFTSQdist = sqrt(distSq)factor = p->mass/distbod->phi -= factorFactor = factor / distSqMULTVEC(diff, factor, extraAcc)ADDVEC(bod->acc, extraAcc, bod->acc)
Apex-Map Stream
Pattern Random
Temp Locality 1
Spatial Locality 4
Mem Total MEM
Reg. Pressure 1
Comp. Intensity 15
Concurrency 1
Results SCALAR
Access Mode NESTED
FDIV, and FSQRT are implemented differently across platforms and will affect the computation of MF/s and Computational Intensity (CI): • use a test program to determine the ratio between fdiv, fsqrt and fadd to decide CI for Apex-Map• use No. Loops/second executed as performance metric instead of MF/s
Performance Prediction for Nbody
10000
11000
12000
13000
14000
15000
16000
17000
18000
4M Bodies 2M Bodies
No
. Lo
op
s /
s
Franklin,Nbody
Franklin,Apex-Map
Jacquard,Nbody
Jacquard,Apex-Map
Apex-Map results match well with Nbody on Franklin, big difference on Jacquard
STREAM
Source Code:==========For ( i = 0; i < N; i++) c[i] = a[i]For ( i = 0; i < N; i++) b[i] = s*c[i]For ( i = 0; i < N; i++) c[i] = a[i]+b[i]For ( i = 0; i < N; i++) a[i] = b[i]+s*c[i]
Stream
Pattern Random
Temp Locality 1
Spatial Locality N
Mem N * ???
Reg. Pressure 1
Comp. Intensity 1
Concurrency 1
Results ARRAY
Access Mode NESTED
0
1000
2000
3000
4000
5000
6000
Copy Scale Add Triad
MF/s
Franklin, Stream
Franklin, Apex-Map
Jacquard, Stream
Jacquard, Apex-Map
Big Perf. Difference Due to: 1. Static vs. Dynamic mem alloc 2. Kernel impl. details
STREAM: Static vs. Dynamic
Static:.text .align 16 .globl tuned_STREAM_Copytuned_STREAM_Copy:..Dcfb4: subq $8,%rsp..Dcfi4:## lineno: 0..EN5:## lineno: 395 movl $c+0,%edi movl $a+0,%esi movl $1048576,%edx .p2align 4,,1 call __c_mcopy8## lineno: 396 addq $8,%rsp ret
Dynamic: .text .align 16 .globl tuned_STREAM_Copytuned_STREAM_Copy:..Dcfb4:## lineno: 0..EN5:## lineno: 402 xorl %ecx,%ecx movl $524288,%edx movl $8,%eax .align 16.LB2164:
## lineno: 402 movq a(%rip),%rsi movq c(%rip),%r8 decl %edx movq (%rsi,%rcx),%rdi movq %rdi,(%r8,%rcx) addq $16,%rcx movq (%rsi,%rax),%r9 movq %r9,(%r8,%rax) addq $16,%rax testl %edx,%edx jg .LB2164## lineno: 403 ret
Different codes are generated for Static and Dynamic (may cause 50% perf diff)
Random Nested (R=1, CI=1)
• Scalar• for (i = 0; i < times; i++) {• index-length = B / L;• initIndexArray(index- length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• }• }• CLOCK(time2);• }
• Array• for (i = 0; i < times; i++) {
• index- length = B / L;
• initIndexArray(index- length);
• CLOCK(time1);
• for (j = 0; j < index- length; j++) {
• for (k = 0; k < L; k++) {• W0[j*L+k] = W0[j*L+k]+c0*(data[ind0[j]+k]);• }• }• CLOCK(time2); • }
• initIndexArray (length):• for (i = 0; i < length; i++) {
• ind0[i] = getIndex(0) * L;
• } How many Load/Store count?
Random Fused(R=1, CI=1)
• Scalar• for (i = 0; i < times; i++) {• index-length = B;• initIndexArray(index- length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• W0 = W0+c0*(data[ind0[j]]);• }• CLOCK(time2);• }
• Array• for (i = 0; i < times; i++) {• index-length = B;
• initIndexArray(index- length);
• CLOCK(time1);• for (j = 0; j < index-length; j++) {• W0[j] = W0[j]+c0*(data[ind0[j]]);• }• CLOCK(time2);• }
• initIndexArray (length):• for (i = 0; i < length; i += L) {
• ind0[i] = getIndex(0) * L;
• for (j = 1; (j < L) && (i+j < length); j++) {
• ind0[i+j] = ind0[i] + j;
• }}
Random Nested Scalar ( R )
• R=1• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• }• }
• R=2• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• W1 = W1+c1*(data[ind1[j]+k]);• }• }
• initIndexArray (length):• for (i = 0; i < length; i++) {
• ind0[i] = getIndex(0) * L;
• }
• initIndexArray (length):• for (i = 0; i < length; i++) {• ind0[i] = getIndex(0) * L;• ind1[I] = getIndex(1) * L;• }
ind0 ind0 ind1
Random Nested Scalar ( CI )
• R=1, CI = 1• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• } }
• R=1, CI = 2• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind0[j]+k]));• } }
• R=2, CI = 4• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]))));
• W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]))));
• } }
Random Nested Scalar ( R, CI )
• R=3, CI = 3• for (i = 0; i < times; i++) {• index-length = B / L;• initIndexArray(index-length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind2[j]+k]+c0*(data[ind1[j]+k])));• W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind2[j]+k])));• W2 = W2+c2*(data[ind2[j]+k]+c2*(data[ind1[j]+k]+c2*(data[ind0[j]+k])));• }• }• CLOCK(time2);• }
Register Pressure ( R ) Effect
L = 4096
0
1000
2000
3000
4000
5000
6000
0.001 0.01 0.1 1
alpha
MF
lop
s/s
1 2 3
4 5 8
alpha = 1
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000
L
MF
lop
s/s
1 2 3
4 5 8
Computational Intensity (CI) Effect
alpha = 1
0
10
20
30
40
50
60
1 10 100 1000 10000 100000
L
MF
lop
s/s
1 2 3
4 5 8
L = 4096
0
10
20
30
40
50
60
70
80
90
100
0.001 0.01 0.1 1
alpha
MF
lop
s/s
1 2 3
4 5 8
% Peak for Random Nested Scalar (R=1, CI=1)
1 0.5
0.25 0.1
0.05
0.025
0.01
0.005
0.0025
0.001
1
16
256409665536
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
% Peak
Nested Scalar 512 MB
90.00-100.00
80.00-90.00
70.00-80.00
60.00-70.00
50.00-60.00
40.00-50.00
30.00-40.00
20.00-30.00
10.00-20.00
0.00-10.00