23
Apex-Map Status Erich Strohmaier and Hongzhang Shan

Apex-Map Status Erich Strohmaier and Hongzhang Shan

Embed Size (px)

DESCRIPTION

Apex-Map Status Erich Strohmaier and Hongzhang Shan. Apex-Map generator. Benchmark code will be generated based on the following performance parameters: PARALLEL: N/Y PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF ACCESS PATTERN: RANDOM / STRIDE - PowerPoint PPT Presentation

Citation preview

Page 1: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Apex-Map Status

Erich Strohmaier and Hongzhang Shan

Page 2: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Apex-Map generator

• Benchmark code will be generated based on the following performance parameters:– PARALLEL: N/Y– PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF– ACCESS PATTERN: RANDOM / STRIDE – SPATIAL LOCALITY (L): [1, M] Default: {1, 4, 16, …, 65536}– CONCURRENCY (I) : [1, X] Default: 1024– TEMPORAL LOCALITY (a): [0,1] Default: {1.0 0.5 0.25 0.1 0.05 0.025 0.01 0.005 0.0025 0.001}– MEMORY SIZE (M) : Default: 67,108,864 Words = 512MB / process– REGISTER PRESSURE ( R ): [1, X] Default: 1– COMPUTATIONAL INTENSITY (CI) : [1, X] Default: 1– ACCESS MODE: FUSED / NESTED – RESULTS: SCALAR / ARRAY (left hand side of statement)– REPEAT TIMES: 100– WARMUP TIMES: 10– CPU MHZ: 1900– PLATFORM: BASSI– VERSION: 1.6– STRIDE: X

– X: any positive integer

Page 3: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Apex-Map Meets Kernels

HPCC Stream

HPCC GUPS

NAS CG NBODY MM-Stride

MM-Vector

Pattern Random Random Random Random Stride Random

Temp Locality 1 1 0.01-0.03 1 0.02

Spatial Locality N 1 1 4 K

Mem N Table Matrix+ Vector

Mem Matrix b

Reg. Pressure 1 1 1 1 1

Comp. Intensity 1 1 1 16 1

Concurrency1 Nupdate Matrix /

Vector1 M

Results Array Array Scalar Scalar Array

Access Mode Nested Nested Fused Nested Nested

Stride N

Page 4: Apex-Map Status Erich Strohmaier and Hongzhang Shan

NAS CG (one stream)

Source Code:==========DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum

ENDDO

Apex-Map Stream

Pattern Random

Temp Locality ???

Spatial Locality 1

Mem E + N

Reg. Pressure 1

Comp. Intensity 1

Concurrency E/N

Results SCALAR

Access Mode FUSED

One-Stream Approach: using one Apex-Map stream to simulate NAS CG performance behavior. Temporal locality currently needs to be defined by experiments.

Page 5: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Performance Prediction for CG (using one stream)

Jacquard

0

100

200

300

400

500

600

700

800

900

S W A B C

MF/s

Jacquard

Apex-Map 0.03

Apex-Map 0.02

Apex-Map 0.01

Franklin

0

100

200

300

400

500

600

700

S W A B C

MF/s

CG

Apex-Map 0.03

Apex-Map 0.02

Apex-Map 0.01

The results indicate that the performance of CG for different data sets can be simulated by Apex-Map using one stream with temporal locality ranging from 0.03 - 0.01 (exception: data set S on Jacquard)

Page 6: Apex-Map Status Erich Strohmaier and Hongzhang Shan

NAS CG (two streams)

Source Code:===========DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sumENDDO

Apex-Map Stream1 Stream2

Pattern Random Random

Temp Locality 1 1

Spatial Locality 1 E/N

Mem N_row E_matrix

Reg. Pressure 1 1

Comp. Intensity 1 1

Concurrency E/N N

Results SCALAR SCALAR

Access Mode FUSED NESTED

Two-Stream Approach: (a, p are treated differently)Perf. of CG = 1/(1/Perf_stream1+1/Perf_stream2)

Page 7: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Performance Prediction for CG (using two streams)

0

100

200

300

400

500

600

700

S W A B C

MF/s

Franklin, CG

Franklin, Apex-Map

Jacquard, CG

Jacquard, Apex-Map

Using two-stream approach, performance matches very well on Jacquard. However, on Franklin, only large data sets match well.

Page 8: Apex-Map Status Erich Strohmaier and Hongzhang Shan

GUPS

Source Code:==========For ( i = 0; i < NUPDATE; i++) { ran = (ran << 1)^ (((s64int) ran < 0) ? POLY : 0); Table[ran & (TableSize -1)] ^= ran;}

Apex-Map Stream

Pattern Random

Temp Locality 1

Spatial Locality 1

Mem TableSize

Reg. Pressure 1

Comp. Intensity 1

Concurrency NUPDATE

Results ARRAY

Access Mode NESTED

60

70

80

90

100

110

120

130

140

150

32 64 128

MB

/s

Franklin, GUPS

Franklin, Apex-Map

Jacquard, GUPS

Jacquard, Apex-Map

Results Match Well!

Page 9: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Matrix-Mul (stride)

Source Code:==========For ( i = 0; i < N; i++) { For ( j = 0; j < K; j++) { tmp = 0; For ( k = 0; k < M; k++) { tmp += a[i*M+k] * b[k*K+j]; } c[i*K+j] = tmp; }}

Apex-Map Stream Stream

Pattern Random Stride

Temp Locality 1 Step: K

Spatial Locality 1

Mem Matrix b Matrix b

Reg. Pressure 1

Comp. Intensity 1

Concurrency M

Results SCALAR

Access Mode NESTED

There are two choices for Apex-Map:1. Use random stream2. Use stride stream

Page 10: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Performance Prediction for Matrix-Mul (stride)

0

10

20

30

40

50

60

70

80

90

2048 4096

MF/s

Franklin,MM

Franklin,Apex-Map,Random

Franklin,Apex-Map,Stride

Jacquard,MM

Jacquard,Apex-Map,Random

Jacquard,Apex-Map,Stride

1. Stride stream matches well.2. Big performance gap between MM and Apex-Map

using random stream

Page 11: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Matrix-Mul (vector)

Source Code:==========For ( i = 0; i < N; i++) For ( k = 0; k < M; k++) For ( j = 0; j < K; j++) c[i*K+j] += a[i*M+k] * b[k*K+j];

Apex-Map Stream

Pattern Random

Temp Locality ???

Spatial Locality K

Mem Matrix b

Reg. Pressure 1

Comp. Intensity 1

Concurrency M

Results ARRAY

Access Mode NESTED0

100

200

300

400

500

600

700

2048 4096

MF/s

Franklin, MM

Franklin, Apex-Map, 1.0

Franklin, Apex-Map, 0.02

Jacquard, MM

Jacquard, Apex-Map, 1.0

Jacquard, Apex-Map, 0.02

On Franklin, perf. Matches well when temp. locality is 0.02. On Jacquard, not a close match (compiler inefficiency for Apex-Map kernels ?)

Page 12: Apex-Map Status Erich Strohmaier and Hongzhang Shan

NBODY

Source Code (Loop Body):===================SUBVEC(p->position, bod->position, diff)DOTPROD(diff, diff, distSq)distSq += SOFTSQdist = sqrt(distSq)factor = p->mass/distbod->phi -= factorFactor = factor / distSqMULTVEC(diff, factor, extraAcc)ADDVEC(bod->acc, extraAcc, bod->acc)

Apex-Map Stream

Pattern Random

Temp Locality 1

Spatial Locality 4

Mem Total MEM

Reg. Pressure 1

Comp. Intensity 15

Concurrency 1

Results SCALAR

Access Mode NESTED

FDIV, and FSQRT are implemented differently across platforms and will affect the computation of MF/s and Computational Intensity (CI): • use a test program to determine the ratio between fdiv, fsqrt and fadd to decide CI for Apex-Map• use No. Loops/second executed as performance metric instead of MF/s

Page 13: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Performance Prediction for Nbody

10000

11000

12000

13000

14000

15000

16000

17000

18000

4M Bodies 2M Bodies

No

. Lo

op

s /

s

Franklin,Nbody

Franklin,Apex-Map

Jacquard,Nbody

Jacquard,Apex-Map

Apex-Map results match well with Nbody on Franklin, big difference on Jacquard

Page 14: Apex-Map Status Erich Strohmaier and Hongzhang Shan

STREAM

Source Code:==========For ( i = 0; i < N; i++) c[i] = a[i]For ( i = 0; i < N; i++) b[i] = s*c[i]For ( i = 0; i < N; i++) c[i] = a[i]+b[i]For ( i = 0; i < N; i++) a[i] = b[i]+s*c[i]

Stream

Pattern Random

Temp Locality 1

Spatial Locality N

Mem N * ???

Reg. Pressure 1

Comp. Intensity 1

Concurrency 1

Results ARRAY

Access Mode NESTED

0

1000

2000

3000

4000

5000

6000

Copy Scale Add Triad

MF/s

Franklin, Stream

Franklin, Apex-Map

Jacquard, Stream

Jacquard, Apex-Map

Big Perf. Difference Due to: 1. Static vs. Dynamic mem alloc 2. Kernel impl. details

Page 15: Apex-Map Status Erich Strohmaier and Hongzhang Shan

STREAM: Static vs. Dynamic

Static:.text .align 16 .globl tuned_STREAM_Copytuned_STREAM_Copy:..Dcfb4: subq $8,%rsp..Dcfi4:## lineno: 0..EN5:## lineno: 395 movl $c+0,%edi movl $a+0,%esi movl $1048576,%edx .p2align 4,,1 call __c_mcopy8## lineno: 396 addq $8,%rsp ret

Dynamic: .text .align 16 .globl tuned_STREAM_Copytuned_STREAM_Copy:..Dcfb4:## lineno: 0..EN5:## lineno: 402 xorl %ecx,%ecx movl $524288,%edx movl $8,%eax .align 16.LB2164:

## lineno: 402 movq a(%rip),%rsi movq c(%rip),%r8 decl %edx movq (%rsi,%rcx),%rdi movq %rdi,(%r8,%rcx) addq $16,%rcx movq (%rsi,%rax),%r9 movq %r9,(%r8,%rax) addq $16,%rax testl %edx,%edx jg .LB2164## lineno: 403 ret

Different codes are generated for Static and Dynamic (may cause 50% perf diff)

Page 16: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Random Nested (R=1, CI=1)

• Scalar• for (i = 0; i < times; i++) {• index-length = B / L;• initIndexArray(index- length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• }• }• CLOCK(time2);• }

• Array• for (i = 0; i < times; i++) {

• index- length = B / L;

• initIndexArray(index- length);

• CLOCK(time1);

• for (j = 0; j < index- length; j++) {

• for (k = 0; k < L; k++) {• W0[j*L+k] = W0[j*L+k]+c0*(data[ind0[j]+k]);• }• }• CLOCK(time2); • }

• initIndexArray (length):• for (i = 0; i < length; i++) {

• ind0[i] = getIndex(0) * L;

• } How many Load/Store count?

Page 17: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Random Fused(R=1, CI=1)

• Scalar• for (i = 0; i < times; i++) {• index-length = B;• initIndexArray(index- length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• W0 = W0+c0*(data[ind0[j]]);• }• CLOCK(time2);• }

• Array• for (i = 0; i < times; i++) {• index-length = B;

• initIndexArray(index- length);

• CLOCK(time1);• for (j = 0; j < index-length; j++) {• W0[j] = W0[j]+c0*(data[ind0[j]]);• }• CLOCK(time2);• }

• initIndexArray (length):• for (i = 0; i < length; i += L) {

• ind0[i] = getIndex(0) * L;

• for (j = 1; (j < L) && (i+j < length); j++) {

• ind0[i+j] = ind0[i] + j;

• }}

Page 18: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Random Nested Scalar ( R )

• R=1• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• }• }

• R=2• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• W1 = W1+c1*(data[ind1[j]+k]);• }• }

• initIndexArray (length):• for (i = 0; i < length; i++) {

• ind0[i] = getIndex(0) * L;

• }

• initIndexArray (length):• for (i = 0; i < length; i++) {• ind0[i] = getIndex(0) * L;• ind1[I] = getIndex(1) * L;• }

ind0 ind0 ind1

Page 19: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Random Nested Scalar ( CI )

• R=1, CI = 1• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]);• } }

• R=1, CI = 2• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind0[j]+k]));• } }

• R=2, CI = 4• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]))));

• W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]))));

• } }

Page 20: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Random Nested Scalar ( R, CI )

• R=3, CI = 3• for (i = 0; i < times; i++) {• index-length = B / L;• initIndexArray(index-length);• CLOCK(time1);• for (j = 0; j < index-length; j++) {• for (k = 0; k < L; k++) {• W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind2[j]+k]+c0*(data[ind1[j]+k])));• W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind2[j]+k])));• W2 = W2+c2*(data[ind2[j]+k]+c2*(data[ind1[j]+k]+c2*(data[ind0[j]+k])));• }• }• CLOCK(time2);• }

Page 21: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Register Pressure ( R ) Effect

L = 4096

0

1000

2000

3000

4000

5000

6000

0.001 0.01 0.1 1

alpha

MF

lop

s/s

1 2 3

4 5 8

alpha = 1

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

L

MF

lop

s/s

1 2 3

4 5 8

Page 22: Apex-Map Status Erich Strohmaier and Hongzhang Shan

Computational Intensity (CI) Effect

alpha = 1

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

L

MF

lop

s/s

1 2 3

4 5 8

L = 4096

0

10

20

30

40

50

60

70

80

90

100

0.001 0.01 0.1 1

alpha

MF

lop

s/s

1 2 3

4 5 8

Page 23: Apex-Map Status Erich Strohmaier and Hongzhang Shan

% Peak for Random Nested Scalar (R=1, CI=1)

1 0.5

0.25 0.1

0.05

0.025

0.01

0.005

0.0025

0.001

1

16

256409665536

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

% Peak

Nested Scalar 512 MB

90.00-100.00

80.00-90.00

70.00-80.00

60.00-70.00

50.00-60.00

40.00-50.00

30.00-40.00

20.00-30.00

10.00-20.00

0.00-10.00