35
CERE: LLVM based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization Chadi Akel , P. de Oliveira Castro, M. Popov, E. Petit, W. Jalby University of Versailles – Exascale Computing Research EuroLLVM 2016 CER E

CERE: LLVM based Codelet Extractor and REplayer for

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CERE: LLVM based Codelet Extractor and REplayer for

CERE LLVM based Codelet Extractor andREplayer for Piecewise Benchmarking and

Optimization

Chadi Akel P de Oliveira Castro M Popov E Petit W Jalby

University of Versailles ndash Exascale Computing Research

EuroLLVM 2016

C E R E

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

1 16

Codelet Extraction

I Extract codelets as standalone microbenchmarks

C or Fortran application

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

codelet wrapper

capture machine state state dump

codelets can be recompiledand run independently

Reg

ion

+IR

2 16

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

3 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 2: CERE: LLVM based Codelet Extractor and REplayer for

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

1 16

Codelet Extraction

I Extract codelets as standalone microbenchmarks

C or Fortran application

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

codelet wrapper

capture machine state state dump

codelets can be recompiledand run independently

Reg

ion

+IR

2 16

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

3 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 3: CERE: LLVM based Codelet Extractor and REplayer for

Codelet Extraction

I Extract codelets as standalone microbenchmarks

C or Fortran application

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

codelet wrapper

capture machine state state dump

codelets can be recompiledand run independently

Reg

ion

+IR

2 16

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

3 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 4: CERE: LLVM based Codelet Extractor and REplayer for

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

3 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 5: CERE: LLVM based Codelet Extractor and REplayer for

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 6: CERE: LLVM based Codelet Extractor and REplayer for

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 7: CERE: LLVM based Codelet Extractor and REplayer for

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 8: CERE: LLVM based Codelet Extractor and REplayer for

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 9: CERE: LLVM based Codelet Extractor and REplayer for

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 10: CERE: LLVM based Codelet Extractor and REplayer for

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 11: CERE: LLVM based Codelet Extractor and REplayer for

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 12: CERE: LLVM based Codelet Extractor and REplayer for

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 13: CERE: LLVM based Codelet Extractor and REplayer for

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 14: CERE: LLVM based Codelet Extractor and REplayer for

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 15: CERE: LLVM based Codelet Extractor and REplayer for

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 16: CERE: LLVM based Codelet Extractor and REplayer for

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 17: CERE: LLVM based Codelet Extractor and REplayer for

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 18: CERE: LLVM based Codelet Extractor and REplayer for

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 19: CERE: LLVM based Codelet Extractor and REplayer for

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 20: CERE: LLVM based Codelet Extractor and REplayer for

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

0 = load i32 i align 4

ret void

original

call void outlined(i32 i

i32 saddr i32 aaddr)

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 21: CERE: LLVM based Codelet Extractor and REplayer for

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 22: CERE: LLVM based Codelet Extractor and REplayer for

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 23: CERE: LLVM based Codelet Extractor and REplayer for

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 24: CERE: LLVM based Codelet Extractor and REplayer for

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 25: CERE: LLVM based Codelet Extractor and REplayer for

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 26: CERE: LLVM based Codelet Extractor and REplayer for

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 27: CERE: LLVM based Codelet Extractor and REplayer for

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 28: CERE: LLVM based Codelet Extractor and REplayer for

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 29: CERE: LLVM based Codelet Extractor and REplayer for

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 30: CERE: LLVM based Codelet Extractor and REplayer for

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 31: CERE: LLVM based Codelet Extractor and REplayer for

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 32: CERE: LLVM based Codelet Extractor and REplayer for

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 33: CERE: LLVM based Codelet Extractor and REplayer for

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 34: CERE: LLVM based Codelet Extractor and REplayer for

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix
Page 35: CERE: LLVM based Codelet Extractor and REplayer for

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

  • Extracting and Replaying Codelets
    • Faithful
    • Retargetable
      • Applications
        • Architecture selection
        • Compiler flags tuning
        • Scalability prediction
          • Demo
            • Capture and replay in NAS BT
            • Simple flag replay for NAS FT
              • Conclusion
              • Appendix