21
SIAM Parallel Processing 2012 Motivation Application Performance Characterization: Current approaches Our approach: General Characteristics Memory Characteristics Experimental Setup Benchmarks Tools Results Conclusion

General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

  • Upload
    lekhue

  • View
    299

  • Download
    2

Embed Size (px)

Citation preview

Page 1: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

  Motivation   Application Performance Characterization: ◦ Current approaches ◦ Our approach:   General Characteristics  Memory Characteristics

  Experimental Setup ◦ Benchmarks ◦ Tools

  Results   Conclusion

Page 2: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

  Mantevo MiniApps are relatively new

  Compare to well-known widely-used benchmark suites (e.g, SPEC CPU2006 )

  Compare to original apps they represent

  Low-level detailed characterization  Provides insight into performance  Reveals optimization opportunities if available  Helps guide and/or validate the development of proxies

(miniApps)  Gives an idea of suitable platforms for the applications to run on  Helps find suitable sets of benchmarks for an experiment  …

SIAM Parallel Processing 2012

Page 3: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 How is it usually done?

 Problems:  No standard set of characteristics  Most studies use microarchitecture/hardware

dependent characteristics  execution time, CPI, miss rates…etc

 Other suggest microarchitecture independent?  Instruction dependence distance, Instruction mix  Spatial and/or temporal locality information…etc

 Limited set of characteristics is usually used due to simulation cost

SIAM Parallel Processing 2012

Page 4: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 Our approach  Wide range of low-level detailed characteristics  better ability to explain performance

 Hardware independent, but ISA dependent  Dynamic binary instrumentation (DBI) tools such as PIN  Most characteristics captured in terms of a frequency

distribution (histogram)  Hardware dependent  Hardware performance counters  Validation

 More efficient  No simulation

SIAM Parallel Processing 2012

Page 5: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

  Instruction Mix   INT, FP, LD, ST, BR   FP: FP, SIMD   LD: INT / FP, E_LD: INT/FP   ST: INT / FP, E_ST: INT/FP   BR: INT/FP

  Instr-dependence distance   Register-to-use distance histogram

  Instr-to-Instr distance histograms   ld-to-ld, fp-to-fp, br-to-br, …etc

  Instr-to-Use distance histograms   ld-to-use, fp-to-use….etc

  Instruction size histogram   Registers read per instruction   Registers written per instruction

SIAM Parallel Processing 2012

Page 6: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

  CPI ( Cycles per Instruction )   Cache miss rates ( per 1k instructions )

 L1, L2, L3…etc   Branch misprediction rate   Totals (for validation purposes)

 Total instructions  Total loads, stores, FP, and branches

SIAM Parallel Processing 2012

Page 7: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 Characteristics obtained from DBI tools   Spatial Locality histogram

  Cache line access stride distribution   Stride is the minimum stride found between current

access and the last N accesses (N currently set at 32)   Max stride one page (4KB)   64-byte cache lines assumed

  Temporal Locality histogram   Memory-Reuse-Distance (MRD) histogram

  MRD is # of unique memory references between two references to the same cache line

  Or MRD is # of unique cache lines referenced between two references to the same cache line

  Max distance currently set to cover 6MB   64-byte cache lines assumed

SIAM Parallel Processing 2012

Page 8: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 Characteristics obtained from DBI tools  Working Set size

  Total unique bytes touched by application   Distribution of unique bytes touched by every 1 billion

instructions   Pattern of executed memory instructions

  Distance defined in number of instructions between memory ops

 Distribution of memory size read/written

 Characteristics obtained from hardware performance counters:   Cache miss rates ( per 1k instructions )

  L1, L2, L3…etc

SIAM Parallel Processing 2012

Page 9: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 MantevoMiniApps   Explicit Finite Element MiniApps

  PhdMesh   Molecular Dynamics MiniApps

  MiniMD   Implicit Finite Element MiniApps

  HPCCG, pHPCCG, MiniFE

 SPEC CPU2006   6 Floating-point benchmarks:

  cactusADM, LBM, Povray, DealII, Leslie3d, Calculix   4 Integer benchmarks:

  Perlbench, Astar, Libquantum, Xalancbmk

  Input sizes:   Mantevo: adjusted for approximately same instruction

count as SPEC   SPEC: reference input

SIAM Parallel Processing 2012

Page 10: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 Platform:   Experiments run on Xeon-E5504, Gainestown (based

on Nehalem), 45nm, 4 core, 256KB L2/core, 4MB L3  Tools:

  PAPI (papiex)   CPI, cache and branch statistics

  PIN ( Dynamic Binary Instrumentation )   All general characteristics   Some memory characteristics   Benchmarks run to completion (~1day each)

  PIN + PinPoints + Simpionts   Spatial & temporal locality characteristics   Simulation points of size 1 billion dynamic instructions

covering 95% of execution   # of points ranges from 3 to 8 with different weights

SIAM Parallel Processing 2012

Page 11: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0

20

40

60

80

100

%

% Stall Cycles

0 0.5

1 1.5

2 2.5

3

CPI

CPI

Page 12: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

Branches

Int Ops

FP Ops

FP Stores

FP Loads

Int Stores

Int Loads

0 2 4 6 8

10 #

of in

stru

ctio

ns

FP-to-Use

0 1 2 3 4 5

# of

inst

ruct

ions

FP-to-FP

Page 13: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0 2 4 6 8

10 12 14

# of

inst

ruct

ions

Instruction Dependence Distance

0 10 20 30 40 50 60 70 80

# of

inst

ruct

ions

Basic Block Size

0 500

1000 1500 2000 2500 3000 3500 4000

Meg

a By

tes

Working Set Size

Page 14: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0.00

10.00

20.00

30.00

40.00

50.00

60.00

L1 Misses/1K inst

0.00 5.00

10.00 15.00 20.00 25.00 30.00 35.00

L2 Misses/1K inst

0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00%

BR Misprediction Rate

0.00 2.00 4.00 6.00 8.00

10.00 12.00 14.00 16.00

L3 Misses/1K inst

Page 15: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

# of

inst

ruct

ions

Distance Between Mem Ops

0%

10%

20%

30%

40%

50%

% Mem Ops

0 1 2 3 4 5

# of

inst

ruct

ions

LD-to-Use

Page 16: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

1E+11

2E+11

3E+11

4E+11

0 1 64 Other

Freq

uenc

y

Stride

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

Page 17: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0

3E+11

6E+11

9E+11

0 1 Other

Freq

uenc

y

Stride

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

2E+11

4E+11

6E+11

8E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

Page 18: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0

2E+11

4E+11

6E+11

8E+11

1E+12

1.2E+12

0 1 64 Other

Freq

uenc

y

Stride

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

30000000

60000000

90000000

1.2E+09

1.5E+09

1.8E+09

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

Page 19: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

SIAM Parallel Processing 2012

0

5E+11

1E+12

1.5E+12

2E+12

0 1 2 Other

Freq

uenc

y

Stride (Calculix)

0 2E+11 4E+11 6E+11 8E+11

0 1 64 Other

Freq

uenc

y

Stride (DealII)

0 2E+11 4E+11 6E+11 8E+11

0 1 15 16 64 Other

Freq

uenc

y

Stride (Leslie3D)

0

5E+11

1E+12

1.5E+12

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

MemReuse Distance(Calculix)

0

2E+11

4E+11

6E+11

0 <=10 <=512 <=4096 <=65536 >65536 Fr

eque

ncy

MemReuse Distance(DealII)

0

2E+11

4E+11

6E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance(Leslie3D)

Page 20: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

•  MiniApps exhibit more memory behavior • >100% more misses(L2 & L3) than SPEC! • Much larger data working set (500% more) • More memory ops per instruction (16% more) • Memory ops are closer to each other ( 2.1 vs. 2.9 ) •  More prone to contention for memory resources

•  MiniApps have much shorter (>100%) dependence distance than SPEC •  Suggests more dependence stalls

•  MiniApps have much shorter basic blocks than SPEC

•  MiniApps experience more stall time (> 33%) •  Greater CPI •  Due to more cache misses & dependence stalls

SIAM Parallel Processing 2012

Page 21: General Characteristics Memory Characteristics ... · General Characteristics Memory Characteristics Experimental Setup ... LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

 Compare performance of MiniApps and real full size apps:  Single node  At scale

 Obtain memory performance characteristics using full runs instead of simulation points  Compare findings to simulation points

 How sensitive performance is to problem size