Upload
aldis
View
38
Download
0
Embed Size (px)
DESCRIPTION
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities ) 3/05/2010. Dynamo. Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612 times - PowerPoint PPT Presentation
Citation preview
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers
Wei Chung Hsu徐慰中Computer Science Department交通大學
(work was done in University of Minnesota, Twin Cities)3/05/2010
Dynamo Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612
times Work started by the HP lab and the HP system lab. MIT took over and ported it to x86, called it
DynamoRIO. This group later started a new company, called Determina (now acquired by VMware)
Considered revolutionary since optimizations were always performed statically (i.e. at compile time)
SPEC CINT2006 for Opteron X4Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.40 724 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.40 837 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
Very high cache miss rate rates Ideal CPI should be 0.33
Time=CPI x Inst x Clock period
Where have all the cycles gone?• Cache misses
– Capacity, Compulsory/Cold, Conflict, Coherence– I-cache and D-cache– TLB misses
• Branch mis-predictions– Static and dynamic prediction– Mis-speculation
• Pipeline stalls– Ineffective code scheduling
often caused by memory aliasing
Unpredictable
Hard to deal with at compile time
Trend of Multi-cores
5Exploiting these potentials demands thread-level parallelism
Intel Core i7 die photo
Exploiting Thread-Level Parallelism
6
Potentially more parallelism with speculation
dependence
Sequential
Store *pLoad *q
Store *p
Time
p != q
Thread-Level Speculation (TLS)
Traditional ParallelizationLoad *q
p != q ??
Store 88
Load 20
Parallel execution
Store 88
Load 88
Speculation Failure
Time
Time
Load 88
Compiler gives up
p == q
But Unpredictable
Dynamic OptimizersDynamic optimizers
Dynamic Binary Optimizers (DBO)
Java VM (JVM) with JIT compiler(dynamic compilationor adaptive optimization)
Native-to-native dynamic binary optimizers (x86 x86, x86-32 x86-64 IA64 IA64)
Non-nativedynamic binarytranslators(e.g. x86 IA64,ARM MIPS,PPC x86, QEMUVmware, Rosetta)
More on why dynamic binary optimizationNew architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary.
x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, …
Software evolution and ISV behaviors reduce effectiveness of traditional static optimizer
DLL, middleware, binary distribution, …
Profile sensitive optimizations would be more effective if performed at runtime
predication, speculation, branch prediction, prefetching
Multi-core environment with dynamic resource sharing makes static optimization challenging
shared cache, off-chip bandwidth, shared FU’s
How Dynamo Works
Interpret untiltaken branch
Lookup branchtarget
Start of tracecondition?
Jump to codecache
Increment counterfor branch target
Counter exceedthreshold?
Interpret +code gen
End-of-tracecondition?
Create trace& optimize it
Emit intocache
Signalhandler
Code Cache
Dynamo is VM based
Trace Selection
A
B C
D
F
G H
I
E
A
C
D
F
G
I
E
call
returnto Bto H
back toruntime
trace layout in
tracecache
trace selection
Backpatching
A
C
D
F
G
I
E
to Bto H
back toruntime
H
I
E
When H becomes hot,a new trace is selectedstarting from H, and thetrace exit branch in block F is backpatched to branch to the new trace.
Execution Migrates to Code Cache
a.out
1
2
3
Code cache
1
2
3
0
4
interpreter/emulator
traceselector
optimizer
Trace Based Optimizations• Full and partial redundancy elimination• Dead code elimination• Trace scheduling• Instruction cache locality improvement• Dynamic procedure inlining (or procedure
outlining)• Some loop based optimizations
Summary of Dynamo• Dynamic Binary Optimization customizes
performance delivery:– Code is optimized by how the code is used
• Dynamic trace formation and trace-based optimizations
– Code is optimized for the machine it runs on– Code is optimized when all executables are
available– Code should be optimized only the parts that
really matters
ADORE ADORE means ADaptive Object code RE-
optimization Was developed at the CSE department, U. of
Minnesota, Twin Cities Applied a very different model for dynamic
optimization systems Considered evolutionary, cited by 61
Dynamic Binary Optimizer’s Models
Application Binaries
DBO
Operating System
Hardware Platform
Application Binaries DBO
Operating System
Hardware Platform
-Translate most execution paths and keep in code cache- Easy to maintain control - Dynamo (PA-RISC) DynamoRIO (x86)
-Translate only hot execution paths and keep in code cache- Lower overhead - ADORE (IA64, SPARC) COBRA (IA64, x86 – ongoing)
ADORE Framework
Hardware Performance Monitoring Unit (PMU)
Kernel
Phase Detection
Trace Selection
Optimization
Deployment
Main ThreadDynamic
OptimizationThread
Code Cache
Init PMUInt. on Event
Int on K-buffer ovf
On phase change
Pass traces to opt
Init Code $ Optimized Traces
Patch traces
Thread Level View
18
Application
K-buffer overflow handler
Init ADORE
ADORE invoked
ADORE invoked
User buffer
fullU
ser buffer full
Thread 1 Thread 2
sleep
sleep
User buffer full is
maintained for 1 main event. This
event is usually
CPU_CYCLES
Perf. of ADORE/Itanium on SPEC2000
Performance on BLAST
-15%
0%
15%
30%
45%
60%
blastnnt.1
blastnnt.10(4)
blastnnt.10(5)
blastnnt.10(7)
blastpaa.1
blastxnt.1
tblastnaa.1
Queries
% S
peed
-up
GCC O2 ORC O2 ECC O2
ADORE vs. DynamoTasks Dynamo ADORE
Observation(profiling)
Interpretation/ instrumentation based
HPM sampling based
Optimization Trace layout and classic optimization
I/D-cache related optimizations(prefetching + trace layout)
Code cache Need large Code$ Small Code$ sufficient
Re-direction Interpretation and trace chaining
Code Patching
ADORE on Multi-cores
• COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines.
• ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines.
• ADORE for TLS tuning
23
COBRA Framework
• Optimization Thread– Centralized Control– Initialization – Trace Selection– Trace Optimization– Trace Patching
• Monitor Threads– Localized Control– Per-thread Profile
Single System Image (Kernel)
Perfmon Sampling Kernel Driver
Kernel Sampling Buffer (KSB)
Per-Thread Monitoring Threads
Per-Thread UserSampling Buffer (USB)
Per-Thread Profile Buffer (PB)
Per-Thread Phase and Profile
Manager
Optimization Thread
MainController
Trace Selectionand Optimization
TraceCache
Main/Working Threads
Multi-Threaded Program With COBRA Monitoring and Optimizing Threads in Same Address Space
Trace Patcher
Processor 0
HardwarePerformance
Counters
Processor 3
HardwarePerformance
Counters
Processor 2
HardwarePerformance
Counters
Processor 1
HardwarePerformance
Counters
24
Startup of 4 thread OpenMP Program
monitorthread
main process (worker thread) vfork
OMP monitor thread OMP monitor thread
worker thread worker thread
worker thread worker thread
Worker thread Worker thread
pthread_create
monitorthreadmonitor
threadmonitorthread
monitoringprocess
Optimzer thread Optimzer thread start
end
1
6
5
4
32
Same Address Space
25
Prefetch vs. NoPrefetch
• The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls.
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)
0.000.200.400.600.801.001.201.401.601.802.00
128K 512K 2Mdata working set size
Nor
mal
ized
exe
cutio
n tim
e to
ba
selin
e
(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)
26%34%
26
Prefetch vs. Prefetch with .excl
• .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol)
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
128K 512K 2Mdata working set size
Nor
mal
ized
exe
cutio
n tim
e to
ba
selin
e
(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)
15%12%
27
Execution time on 4-way SMP
0.900
0.950
1.000
1.050
1.100
1.150
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks
Spee
dup
rela
tive
to b
asel
ine
(pre
fetc
h)(4, prefetch) (4, noprefetch) (4, prefetch.excl)
8%
15%
2.7%4.7%
noprefetch: up to 15%, average 4.7% speedup prefetch.excl: up to 8%, average 2.7% speedup
28
Execution time on cc-NUMA
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks
Spee
dup
rela
tive
to b
asel
ine
(pre
fetc
h)(8, prefetch) (8, noprefetch) (8, prefetch.excl)
17.5%8.5%
68%
18%
noprefetch: up to 68%, average 17.5% speedup prefetch.excl: up to 18%, average 8.5% speedup
29
Summary of Results from COBRA• We showed that coherent misses caused by aggressive
prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors.
• With the guide of runtime profile, we experimented two optimizations.– Reducing aggressiveness of prefetching
• Up to 15%, average 4.7% speedup on 4-way SMP• Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA
– Using exclusive hint for prefetch• Up to 8%, average 2.7% speedup on 4-way SMP• Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA
ADORE/SPARC• ADORE has been ported to Sparc/Solaris
platform since 2005.• Some porting issues:
– ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead
– Reachability is a true problem. (e.g. Oracle, Dyna3D)– Lack of branch trace buffer is painful. (e.g. Blast)
Performance of In-Thread Opt. (USIII+)
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
BasePeak
Helper Thread Prefetching for Multi-Core
Main threadMain thread
Second coreSecond corePrefetches initiatedPrefetches initiated
Cache miss avoidedCache miss avoided L2L2
CacheCacheMissMiss
timeFirst CoreFirst Core
Trigger to activate (About 65 cycles delay)
Spin Waiting Spin again waiting for the next trigger
Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
BasePeak
Evaluation Environment for TLS
Benchmarks SPEC2000 written in C, -O3 optimization
Underlying architecture 4-core, chip-multiprocessor (CMP) speculation supported by coherence
Simulator Superscalar with detailed memory model simulates communication latency models bandwidth and contention
Detailed, cycle-accurate simulation
C
C
P
C
P
Interconnect
C
P
C
P
34
Dynamic Tuning for TLS
35
1.17x
1.23x
1.37x
Parallel Code Overhead
Summary of ADORE• ADORE uses Hardware Performance Monitoring
(HPM) capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers.
• ADORE can speed up real-world large applications optimized by production compilers.
• ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86.
• ADORE/COBRA can also optimize for multi-cores.• ADORE has recently been applied to dynamic TLS
tuning.
Conclusion“It was the best of times,
it was the worst of times…” -- opening line of “A Tale of Two Cities”
best of times for research: new areas where innovations are needed worst of times for research:saturated area where technologies are matured or well-understood, hard to innovate, …