CERE: LLVM based Codelet Extractor and REplayer for

CERE LLVM based Codelet Extractor andREplayer for Piecewise Benchmarking and

Optimization

Chadi Akel P de Oliveira Castro M Popov E Petit W Jalby

University of Versailles ndash Exascale Computing Research

EuroLLVM 2016

C E R E

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

Codelet Extraction

I Extract codelets as standalone microbenchmarks

C or Fortran application

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]

codelet wrapper

capture machine state state dump

codelets can be recompiledand run independently

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

Outline

Conclusion

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0 1000 2000 3000invocation

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

Performance Signature Clustering

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Sandy Bridge

FullBenchmarks

Results

[Oliveira Castro et al 2014]

Codelet Based Architecture Selection

012 015

083 083

155 159

Atom Core 2 Sandy Bridge

Real Speedup

Predicted Speedup

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

Outline

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)

ret void

original

call void outlined(i32 i

Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

Accuracy Summary NAS SER

Core2 Haswell

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

Accuracy Summary SPEC FP 2006

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

CERE page capture dump size

117131

101 96

bt cg ep ft is lu mg sp

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Core 2

Sandy Bridge

1020304050

10203040

0 5 10 15 20 25Number of clusters

Median

GA features

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

Profiling Features

codelets

Likwid

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

Extracting and Replaying Codelets

Faithful

Retargetable

Applications

Architecture selection

Compiler flags tuning

Scalability prediction

Demo

Capture and replay in NAS BT

Simple flag replay for NAS FT

Conclusion

Appendix

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

Codelet Extraction

codelet wrapper

CERE Workflow

outlining

RegionCapture

Fast performance

prediction

Warmup+

Replay

capture

CodeletReplay

Invocationamp

Codeletsubsetting

I Hot Loops

Outline

Conclusion

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Codelet Extraction

codelet wrapper

CERE Workflow

outlining

RegionCapture

Fast performance

prediction

Warmup+

Replay

capture

CodeletReplay

Invocationamp

Codeletsubsetting

I Hot Loops

Outline

Conclusion

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERE Workflow

outlining

RegionCapture

Fast performance

prediction

Warmup+

Replay

capture

CodeletReplay

Invocationamp

Codeletsubsetting

I Hot Loops

Outline

Conclusion

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline

Conclusion

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Faithful capture

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

regionto capture

1 allocate memory

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline

Conclusion

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

0 1000 2000 3000invocation

replay

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Likwid

f1f2f3

Sandy Bridge

FullBenchmarks

Results

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

012 015

083 083

155 159

Real Speedup

Predicted Speedup

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

1 2 4 8 16 32Threads

RealPredicted

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Conclusion

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

C E R E

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Bibliography I

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

original

label forbody

label forexitStub

i32 i i32 saddr

i32 aaddr)

ret void

original

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERECodeIsolator

Astex CodeletFinder

SimPoint

Fortranassembly

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Core2 Haswell

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Haswell

100SPEC

leslie3

gromac

mplbm milc

gemsfd

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

117131

101 96

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERE cache warmup

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

Reprotect 20

memory

warmup page trace

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Test architectures

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

cut for K = 14

ludcmp_4hqr_15

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

cut for K = 14

ludcmp_4hqr_15

Computation Pattern

ReductionSums

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

slower

8001500

slower

40 50 60

slower

400faster

Reference Target

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Feature Selection

Core 2

Sandy Bridge

1020304050

10203040

Median

GA features

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Profiling Features

codelets

Likwid

4 dynamic features

8 static features

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

C code

Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Documents

CERE: LLVM based Codelet Extractor and REplayer for