Upload
norris
View
32
Download
4
Embed Size (px)
DESCRIPTION
Profile-Based Dynamic Optimization Research for Future Computer Systems. Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004. Brief history of ‘my’ research. 1970’s: The MPG System - PowerPoint PPT Presentation
Citation preview
Utsunomiya University1 November 12, 2004Seminar@UW-Madison
Profile-Based Dynamic Optimization Research for Future Computer Systems
Takanobu Baba
Department of Information Science
Utsunomiya University, Japan
http://aquila.is.utsunomiya-u.ac.jp
November 12, 2004
Utsunomiya University2 November 12, 2004Seminar@UW-Madison
Brief history of ‘my’ research• 1970’s: The MPG System
A Machine-Independent Efficient Microprogram
Generator
• 1980’s: MUNAP
A Two-Level Microprogrammed Multiprocessor
Computer
• 1990’s: A-NET
A Language-Architecture Integrated Approach
for Parallel Object-Oriented Computation
Utsunomiya University3 November 12, 2004Seminar@UW-Madison
A Two-Level Microprogrammed Multiprocessor Computer-MUNAP
A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle
MUNAP
Utsunomiya University4 November 12, 2004Seminar@UW-Madison
A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork )
• Massively parallel computation•Each node consists of a PE and a router.•PE has the language-oriented, typical CISC architecture.•The programmable router is topology- independent.
A-NET Multicomputer
Utsunomiya University5 November 12, 2004Seminar@UW-Madison
Current dynamic optimization projects• Computation-oriented:
– YAWARA: A meta-level optimizing computer system
– HAGANE: Binary-level multithreading
• Communication-oriented:– Spec-All: Aggressive Read/Write Access
Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic
Information
Utsunomiya University6 November 12, 2004Seminar@UW-Madison
YAWARA: A Meta-Level Optimizing Computer System
Utsunomiya University7 November 12, 2004Seminar@UW-Madison
Background• Moore’s Law will be maintained by the
semiconductor technology
• how can we utilize the huge amount of transistors for speedup of program execution?
• our idea is to utilize some chip area for
dynamically and autonomously tuning the configuration of on-chip multiprocessor
Utsunomiya University8 November 12, 2004Seminar@UW-Madison
Base-level processor
Memory
Instructions and data
Results of computation
Meta-level
Profile ofcontrol anddata
Meta-levelprocessor
Base-level processor
Memory
Results of optimization
Base-level
Utsunomiya University9 November 12, 2004Seminar@UW-Madison
Design considerations
• HW vs. SW reconfiguration
→ SW reconfiguration
• Static vs. dynamic reconfiguration
→ both a static and dynamic reconfig. capability
• Homogeneous vs. heterogeneous architecture
→ unified homogeneous structure
Utsunomiya University10 November 12, 2004Seminar@UW-Madison
MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread
MT
Meta-level
PT PTPTPTPT PT
OTOTOT
OT
OT
OT
Base-level
CTCT
CTCT
CTCTCT CT
CTManagement Thread
ProfilingApplication
Optimization
Memory
Basic concepts of thread-level reconfiguration
CT
CTCT
CT
OT CT
Utsunomiya University11 November 12, 2004Seminar@UW-Madison
Execution modelManagement Thread
(MT)
Computing Thread(CT)
collect profile
Optimizing Thread(OT)collect profile
optimization initiate condition satisfied
sleep
wake up
activate
activate
Computing Thread(CT)
Profiling Thread(PT)
Profiling-centric
sleep
sleep
Profiling Thread(PT)
collect profile
optimization initiate condition satisfied
Computing-centric
Utsunomiya University12 November 12, 2004Seminar@UW-Madison
MT OT
OT OT
OTOT
PT
PT PT
Meta-level Base-level
CT
CT
CT
CT
PT
PT
PT
CT
CTPT
PT
CT
CTPT
PT
OT
OTOT
OT PT
OT
OTOT
OTOT
CT
CT
MT OT
OT OT
CT
OTOT
PT
PT
PT CT
CT
CT
CT
CT
CT
CT
CT CT
CT CT
CT CT
CT CT
CT CTCT
CT
PT
PT
PT
PT
OT
OT
OT
MT OT
OT
CT
PT
CT CT
CTCT
PT CT
CT
CT
CT
CT
CT
CT CT
CT CT
CT CTCT
CT
PT
CT
CT CT
CT CT
CT
CT CT
CT CTCT
Change of configurations by meta-level optimization
Utsunomiya University13 November 12, 2004Seminar@UW-Madison
The YAWARA System
• an implementation of the computation model
• the SW system consists of static and dynamic optimization systems
• the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads
spirit of YAWARA ・・・ "A flexible method prevails where a rigid one fails."
Utsunomiya University14 November 12, 2004Seminar@UW-Madison
Source Code(C/C++,Java,Fortran,…)
SOS(Static Optimization System)
ExecutionProfile
DOS(Dynamic Optimization System)
Code Analysis
Info
Static feedback
Dynamic feedback
Run-timeProfile
Executable image
Execution Results
TE(Thread Engine)
Thread Engines
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)TE
(Thread Engine)
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)TE
(Thread Engine)
Software System
Utsunomiya University15 November 12, 2004Seminar@UW-Madison
TE TE TE
TE TE TE TE
TE TE TE TE
TE TE TE TE
Hardware System
net-work
IN
net-workOUTregister
fileINT*4 + FP*1
thread-data
cache
D$
to/from network
I$
Thread Engine(TE)
profiling buffer
thread- code cache
thread-0thread
-1thread-2 thread
-N
executioncontrol
I$ D$
feedback-directed resource control
profiling controller
TE
Utsunomiya University16 November 12, 2004Seminar@UW-Madison
Hot loop
Example application – compress –
8
911 1
213
21 2
2
8
9
1112
13
21
22
Hot path
hot loop / hot path detection (PT, OT)
・ speculative multithreading code generation・ helper threads generation・ path predictor generation (OT)
・ management thread ( MT )
Hot path#0Speculative
thread
#0
#1
Speculative multithreading usingpath prediction mechanism
#1
hit
9
1112
8
19
109
11 1
2
13
14
15
17
18
20
23
24
25
19
8
16
21
22
10
14
15
17
18
20
19
16
9
11 1
2
13
23
24
25
8
21
22
12
14
15
17
18
20
23
16
21
22
109
1112
24
25
19
8
Phased behavior
・ speculative multithreading profiling (PT)
Base
Meta
miss #1⇒
i - 1
i i +1#0
#1hit #0
(CT)
Utsunomiya University17 November 12, 2004Seminar@UW-Madison
Conclusion -YAWARA-
• we proposed an autonomous reconfiguration mechanism based on dynamic behavior
• we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently
• we are now developing the software system and the simulator.
Utsunomiya University18 November 12, 2004Seminar@UW-Madison
Prediction and Execution Methods of Frequently Executed Two Paths for
Speculative Multithreading
YAWARA@PDCS2004
Utsunomiya University19 November 12, 2004Seminar@UW-Madison
Occurrence ratios of the top-two paths
54.5% 22.4%
48.2% 42.1%
97.0% 3.0%
80.7% 19.3%
compress/compress
ijpeg/forward_DCT
m88ksim/killtime
li/sweep
The top two paths occupy 80-100% of execution
#1 path #2 path other p aths
Utsunomiya University20 November 12, 2004Seminar@UW-Madison
Two-level path prediction• Introducing two-level branch prediction
– history register keeps sequence of #1 path executions (1: #1, 0: the other paths)
– counter table counts #1 path executions
1101 v0v1
v13v14v15
:
history register
threshold: X
if v13 >= Xpredict #1
otherwisepredict #2
counter table
Single Path Predictor (SPP)Single Path Predictor (SPP)
Utsunomiya University21 November 12, 2004Seminar@UW-Madison
Another path predictor
1101 v0v1
v13v14v15
:
#1 pathhistory register
if v13 >= v2predict #1
otherwisepredict #2
#1 pathcounter table
Dual Path Predictor (DPP)Dual Path Predictor (DPP)
0010 v0v1
:v14v15
v2
#2 pathhistory register #2 path
counter table
Utsunomiya University22 November 12, 2004Seminar@UW-Madison
Single Speculation (SS)
#1
path
#1
path
continuespeculative execution
executenon-speculativethread
#1
path
recovery process
#1
path
When a thread fails …Abort succeeding threads
Recovery processNon-speculative execution
Speculation failure degrades performance
Continue speculative execution
Utsunomiya University23 November 12, 2004Seminar@UW-Madison
Double Speculation (DS)
• Even when 1st speculation fails,secondary choice has high possibilitybecauseTop-Two Paths are Dominant.
54.5% 22.4%
48.2% 42.1%
97.0% 3.0%
80.7% 19.3%
compress/compress
ijpeg/forward_DCT
m88ksim/killtime
li/sweep
expected #2 hit = 49.2%
expected #2 hit = 81.3%
expected #2 hit = 100%expected #2 hit = 100%
Utsunomiya University24 November 12, 2004Seminar@UW-Madison
Double Speculation (DS)
If secondary speculation succeeds,performance loss is not so large.
#1
path
#2
path
#1
path
#1
path
continuespeculative execution
#1
path
recovery process
#2
path
#1
path
secondaryspeculation
Utsunomiya University25 November 12, 2004Seminar@UW-Madison
Evaluation flow
hot-path detection(SIMCA)
thread codes• #1 path speculative thread• #2 path speculative thread• non-speculative thread
performanceestimator
path historyacquisition (SIMCA)
thread-codegeneration
path execution history
speculation hit ratiospeed-up ratio
Utsunomiya University26 November 12, 2004Seminar@UW-Madison
SPP DPP
SPP DPP
Prediction success ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
succ
. rat
io (%
)su
cc. r
atio
(%)
forward_DCTforward_DCT
compresscompress
Utsunomiya University27 November 12, 2004Seminar@UW-Madison
SPP DPP
SPP DPP
Prediction success ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
succ
. rat
io (%
)su
cc. r
atio
(%)
sweepsweep
killtimekilltime
Utsunomiya University28 November 12, 2004Seminar@UW-Madison
SS DS
SS DS
Speed-up ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
1.0
2.0
0
1.0
2.0
3.0
4.0
speed
-up
ra
tio
speed
-up
ra
tio
forward_DCTforward_DCT
compresscompress
S100
P1only
1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100
P1only
Utsunomiya University29 November 12, 2004Seminar@UW-Madison
SS DS
SS DS
Speed-up ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
2.0
3.0
0
1.0
2.0
3.0
speed
-up
ra
tio
speed
-up
ra
tio
sweepsweep
killtimekilltime
S100
P1only
1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100
P1only
1.0
Utsunomiya University30 November 12, 2004Seminar@UW-Madison
Conclusions- Two-Path-Limited Speculative Multithreading -
• We proposed
- path prediction method and predictors
- speculation methods
for path-based speculative multithreading
• Preliminary performance estimation results are shown
Utsunomiya University31 November 12, 2004Seminar@UW-Madison
Current and future works
• Accurate and detailed evaluation for various applications
SPEC 2000, MediaBench, …
• Integration to our Dynamic Optimization
Framework YAWARA
Utsunomiya University32 November 12, 2004Seminar@UW-Madison
Current dynamic optimization projects• Computation-oriented:
– YAWARA: A meta-level optimizing computer system
– HAGANE: Binary-level multithreading
• Communication-oriented:– Spec-All: Aggressive Read/Write Access
Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic
Information
Utsunomiya University33 November 12, 2004Seminar@UW-Madison
HAGANE:Binary-Level Multithreading
Utsunomiya University34 November 12, 2004Seminar@UW-Madison
Background
• Multithread programming is not so easy.
→ Automatic multithreading system
However…
• Source codes are not always available.
→ Multithreading at binary level
Utsunomiya University35 November 12, 2004Seminar@UW-Madison
STO(Static Translator
& Optimizer)
Source Binary Code
Multithreaded Binary Code(statically translated)
Process Memory Image
DTO(Dynamic Translator
& Optimizer)
Multithread Processor
Multithreaded Binary Code(dynamically translated)
AnalysisInfo
Execution Profile Info
Binary Translator & Optimizer System
ExecutionProfile
Utsunomiya University36 November 12, 2004Seminar@UW-Madison
Thread Pipelining Model
Continuation
TSAG
Computation
Write-back
Continuation
TSAG
Computation
Write-back
Continuation
TSAG
Computation
Write-backTSAG = Target Store Address Generation
Thread i
Thread i+1
Thread i+2
- Loop iterations are mapped onto threads
Utsunomiya University37 November 12, 2004Seminar@UW-Madison
mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]
bstrslti $v0[2],$v1[3],5000beq $v0[2],$zero[0],$ST_LL0addu $t0[8],$a0[4],$zero[0]addu $t1[9],$a1[5],$zero[0]addi $v1[3],$v1[3],1addi $a0[4],$a0[4],4addi $a1[5],$a1[5],4lfrkwtsagdaddu $t2[10],$sp[28],$zero[0]altsw $t2[10]tsagdl.s $f0,0($t0[8])l.s $f2,0($t1[9])l.s $f4,0($t2[10])mul.s $f0,f0,f2add.s $f4,$f4,$f0sttsw $t2[10],$f4
$ST_LL0:estrmov.s $f0,$f4jr $ra[31]
Example translation
Source Binary Code
Translated Code
・ Thread Management Instructions
・ Overhead code for multithreading
mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]
$BB1:l.s $f0,0($a0[4])l.s $f2,0($a1[5])mul.s $f0,f0,f2addiu $v1[3],$v1[3],1add.s $f4,$f4,$f0slti $v0[2],$v1[3],5000addiu $a1[5],$a1[5],4addiu $a0[4],$a0[4],4bne $v0[2],$zero[0],$BB1
$BB2:mov.s $f0,$f4jr $ra[31]
Cont.
TSAG
Comp.
W.B.
Utsunomiya University38 November 12, 2004Seminar@UW-Madison
L1 Data Cache
L1 Instruction Cache
ExecutionUnit
CommunicationUnit
MemoryBuffer
Write-BackUnit
Thread Processing UnitExecution
Unit
CommunicationUnit
MemoryBuffer
Write-BackUnit
Thread Processing Unit
● ● ●
Superthreaded Architecture
Utsunomiya University39 November 12, 2004Seminar@UW-Madison
m88ksim (SPECint95)
0
1
2
3Sp
eedu
p R
atio
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
•poor speedup ratios•loop unrolling does not affect the performance •number of iterations is quite small.
Utsunomiya University40 November 12, 2004Seminar@UW-Madison
ijpeg (SPECint95)
0
1
2
3
4
5
Spee
dup
Rat
io
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
•the thread code size is too small to hide the thread management overhead• loop unrolling is effective to achieve good speedup ratios• excessive loop unrolling causes performance degradation• number of iterations is not so large.
Utsunomiya University41 November 12, 2004Seminar@UW-Madison
swim (SPECfp95)
0123456789
1011
Sp
eed
up
Rat
io
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
• good speedup ratios• loop unrolling is effective to achieve linear speedup• number of iterations is large.
Utsunomiya University42 November 12, 2004Seminar@UW-Madison
Conclusion-HAGANE-
• We have evaluated the binary-level multithreading using some SPEC95 benchmark programs.
• The performance evaluation results indicate:– the thread code size should be large enough to
improve the performance.– loop unrolling is effective for the small loop body.– excessive loop unrolling degrades performance
Utsunomiya University43 November 12, 2004Seminar@UW-Madison
A Methodology ofBinary-Level Variable Analysis
for Multithreading
HAGANE@PDCS2004
Utsunomiya University44 November 12, 2004Seminar@UW-Madison
Background and Objective
Usually, loop-iterations are interrelated through memory variables, such as induction ones.
Binary-level variable analysis method is strongly required for binary-level multithreading.
However, it is difficult to analyze this kind of dependency at binary level.
Utsunomiya University45 November 12, 2004Seminar@UW-Madison
for (i = 1; i < N; i++) {z = i * 2;x = a[i-1];
y = x * 3;a[i] = z + y;
}
Example Binary Codelw $a1[5], 16($s8[30])lw $v1[3], 16($s8[30])lw $a0[4], 16($s8[30])sll $v1[3], $v1[3], 0x2addu $v1[3], $v1[3], $a2[6]lw $v0[2], 16($s8[30])lw $v1[3], -4($v1[3])addiu $v0[2], $v0[2], 1sw $v0[2], 16($s8[30])lw $v0[2], 16($s8[30])sll $a1[5], $a1[5], 0x1sll $a0[4], $a0[4], 0x2sll $v0[2], $v1[3], 0x1addu $v0[2], $v0[2], $v1[3]lw $v1[3], 16($s8[30])addu $a0[4], $a0[4], $a2[6]addu $a1[5], $a1[5], $v0[2]sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7]
Thread ji++
load a[i-1]
s tore a[i]
load a[i-1]
i++
s tore a[i]
Thread j+1
-4($v1[3])
0($a0[4])
Utsunomiya University46 November 12, 2004Seminar@UW-Madison
Binary-Level Variable Analysis
(1) Register values are analyzed using data flow trees.
(2) When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register.
(3) Using the virtual registers, steps (1) and (2) are repeated.
Utsunomiya University47 November 12, 2004Seminar@UW-Madison
$2#2+
$3#1+
$2#1
$5#1+
$4#0
$0 $0
Construction of Dataflow Tree
addiu $29#1, $29#0, -8
sw $0, 0($29#1)
addu $5#1, $0, $0
lw $2#1, 0($29#1)
addu $3#1, $5#1, $4#0
addiu $5#2, $5#1, 1
addu $2#2, $2#1, $3#1
sw $2#2, 0($29#1)
slti $2#3, $5#2, 100
bne $2#3, $0, L1
Utsunomiya University48 November 12, 2004Seminar@UW-Madison
Example Normalization
$2#2+
$2#1sll
14
$7#1+
2
0 $4#0
$2#2+
$4#0 4
14*
Utsunomiya University49 November 12, 2004Seminar@UW-Madison
Detection ofLoop Induction Variables
Loop induction variable is the register, which– has inter-iteration dependency, and– increases with a fixed value between iterations.
$V2#2+
1$V2#1
The concept of virtual register makes it possible to detect induction variables on memory.
Utsunomiya University50 November 12, 2004Seminar@UW-Madison
Application
• 101.tomcatv of SPECfp95 Benchmark
• Fortran to C translator ver. 19940927
• GCC cross compiler ver 2.7.2.3 for SIMCA
• Data set: test
• The six most inner loops (#1-#6) are selected
• They have induction variables on memory
Utsunomiya University51 November 12, 2004Seminar@UW-Madison
Speedup Ratios
9.804
1.643
5.178
1.800
3.5832.611
5.361
0
2
4
6
8
10
12
#1 #2 #3 #4 #5 #6 ALL
Loop
Spee
dup
rati
o
Utsunomiya University52 November 12, 2004Seminar@UW-Madison
Conclusion -Binary-Level Variable Analysis-
• We proposed a binary-level variable analysis method.
• This method makes it possible to detect induction variables and the increment/decrement values.
• The detected information allows us to multithread binary codes; they may not be multithreaded without our algorithm.
• We attained up to 9.8 speedup by the multithreading.
Utsunomiya University53 November 12, 2004Seminar@UW-Madison
Summary
• Dynamic optimization projects at our laboratory
• The results show the performance improvement quantitatively in each project
Utsunomiya University54 November 12, 2004Seminar@UW-Madison
What’s the next step of computer architecture research?
• from performance to reliability? or low power?
e.g. dependable computing• architecture for new device technologies?
e.g. quantum computing
However….
if we stick to conventional high-performance computing research,
what’s the promising way?