Profile-Based Dynamic Optimization Research for Future Computer Systems

Utsunomiya University1 November 12, 2004Seminar@UW-Madison

Profile-Based Dynamic Optimization Research for Future Computer Systems

Takanobu Baba

Department of Information Science

Utsunomiya University, Japan

http://aquila.is.utsunomiya-u.ac.jp

November 12, 2004

　


Brief history of ‘my’ research• 1970’s: The MPG System

A Machine-Independent Efficient Microprogram

Generator

• 1980’s: MUNAP

A Two-Level Microprogrammed Multiprocessor

Computer

• 1990’s: A-NET

A Language-Architecture Integrated Approach

for Parallel Object-Oriented Computation


A Two-Level Microprogrammed Multiprocessor Computer-MUNAP

A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle

MUNAP


A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork )

• Massively parallel computation•Each node consists of a PE and a router.•PE has the language-oriented, typical CISC architecture.•The programmable router is topology- independent.

A-NET Multicomputer


Current dynamic optimization projects• Computation-oriented:

– YAWARA: A meta-level optimizing computer system

– HAGANE: Binary-level multithreading

• Communication-oriented:– Spec-All: Aggressive Read/Write Access

Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic

Information


YAWARA: A Meta-Level Optimizing Computer System


Background• Moore’s Law will be maintained by the

semiconductor technology

• how can we utilize the huge amount of transistors for speedup of program execution?

　　　　　　　　　　　• our idea is to utilize some chip area for

dynamically and autonomously tuning the configuration of on-chip multiprocessor


Base-level processor

Memory

Instructions and data

Results of computation

Meta-level

Profile ofcontrol anddata

Meta-levelprocessor

Base-level processor

Memory

　 Results of　 optimization

Base-level


Design considerations

• HW vs. SW reconfiguration

→ SW reconfiguration

• Static vs. dynamic reconfiguration

　　 → both a static and dynamic reconfig. capability

• Homogeneous vs. heterogeneous architecture

　　 → unified homogeneous structure


MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread

MT

Meta-level

PT PTPTPTPT PT

OTOTOT

OT

OT

OT

Base-level

CTCT

CTCT

CTCTCT CT

CTManagement Thread

ProfilingApplication

Optimization

Memory

Basic concepts of thread-level reconfiguration

CT

CTCT

CT

OT CT


Execution modelManagement Thread

(MT)

Computing Thread(CT)

collect profile

Optimizing Thread(OT)collect profile

optimization initiate condition satisfied

sleep

wake up

activate

activate

Computing Thread(CT)

Profiling Thread(PT)

Profiling-centric

sleep

sleep

Profiling Thread(PT)

collect profile

optimization initiate condition satisfied

Computing-centric


MT OT

OT OT

OTOT

PT

PT PT

Meta-level Base-level

CT

CT

CT

CT

PT

PT

PT

CT

CTPT

PT

CT

CTPT

PT

OT

OTOT

OT PT

OT

OTOT

OTOT

CT

CT

MT OT

OT OT

CT

OTOT

PT

PT

PT CT

CT

CT

CT

CT

CT

CT

CT CT

CT CT

CT CT

CT CT

CT CTCT

CT

PT

PT

PT

PT

OT

OT

OT

MT OT

OT

CT

PT

CT CT

CTCT

PT CT

CT

CT

CT

CT

CT

CT CT

CT CT

CT CTCT

CT

PT

CT

CT CT

CT CT

CT

CT CT

CT CTCT

Change of configurations by meta-level optimization


The YAWARA System

• an implementation of the computation model

• the SW system consists of static and dynamic optimization systems

• the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads

spirit of YAWARA ・・・　 "A flexible method prevails where a rigid one fails."


Source Code(C/C++,Java,Fortran,…)

SOS(Static Optimization System)

ExecutionProfile

DOS(Dynamic Optimization System)

Code Analysis

Info

Static feedback

Dynamic feedback

Run-timeProfile

Executable image

Execution Results

TE(Thread Engine)

Thread Engines

TE(Thread Engine)TE

(Thread Engine)TE(Thread Engine)

TE(Thread Engine)TE

(Thread Engine)TE(Thread Engine)TE

(Thread Engine)

TE(Thread Engine)TE

(Thread Engine)TE(Thread Engine)TE

(Thread Engine)

Software System


TE TE TE

TE TE TE TE

TE TE TE TE

TE TE TE TE

Hardware System

net-work

IN

net-workOUTregister

fileINT*4 + FP*1

thread-data

cache

D$

to/from network

I$

Thread Engine(TE)

profiling buffer

thread- code cache

thread-0thread

-1thread-2 thread

-N

executioncontrol

I$ D$

feedback-directed resource control

profiling controller

TE


Hot loop

Example application – compress –

８

９１１１

２１３

２１２

２

8

9

1112

13

21

22

Hot path

hot loop / hot path detection (PT, OT)

・ speculative multithreading code generation・ helper threads generation・ path predictor generation　　　　　　　　　　　　(OT)

・ management thread （ MT ）

Hot path#0Speculative

thread

#0

#1

Speculative multithreading usingpath prediction mechanism

#1

hit

9

1112

8

19

１０９

１１１

２

１３

１４

１５

１７

１８

２０

２３

２４

２５

１９

８

１６

２１

２２

１０

１４

１５

１７

１８

２０

１９

１６

９

１１１

２

１３

２３

２４

２５

８

２１

２２

12

14

15

17

18

20

23

16

21

22

109

1112

24

25

19

8

Phased behavior

・ speculative multithreading profiling (PT)

Base

Meta

miss #1⇒

i - 1

i i +1#0

#1hit #0

(CT)


Conclusion -YAWARA-

• we proposed an autonomous reconfiguration mechanism based on dynamic behavior

• we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently

• we are now developing the software system and the simulator.


Prediction and Execution Methods of Frequently Executed Two Paths for

Speculative Multithreading

YAWARA@PDCS2004


Occurrence ratios of the top-two paths

54.5% 22.4%

48.2% 42.1%

97.0% 3.0%

80.7% 19.3%

compress/compress

ijpeg/forward_DCT

m88ksim/killtime

li/sweep

The top two paths occupy 80-100% of execution

#1 path #2 path other ｐ aths


Two-level path prediction• Introducing two-level branch prediction

– history register keeps sequence of #1 path executions (1: #1, 0: the other paths)

– counter table counts #1 path executions

1101 v0v1

v13v14v15

:

history register

threshold: X

if v13 >= Xpredict #1

otherwisepredict #2

counter table

Single Path Predictor (SPP)Single Path Predictor (SPP)


Another path predictor

1101 v0v1

v13v14v15

:

#1 pathhistory register

if v13 >= v2predict #1

otherwisepredict #2

#1 pathcounter table

Dual Path Predictor (DPP)Dual Path Predictor (DPP)

0010 v0v1

:v14v15

v2

#2 pathhistory register #2 path

counter table


Single Speculation (SS)

#1

path

#1

path

continuespeculative execution

executenon-speculativethread

#1

path

recovery process

#1

path

When a thread fails …Abort succeeding threads

Recovery processNon-speculative execution

Speculation failure degrades performance

Continue speculative execution


Double Speculation (DS)

• Even when 1st speculation fails,secondary choice has high possibilitybecauseTop-Two Paths are Dominant.

54.5% 22.4%

48.2% 42.1%

97.0% 3.0%

80.7% 19.3%

compress/compress

ijpeg/forward_DCT

m88ksim/killtime

li/sweep

expected #2 hit = 49.2%

expected #2 hit = 81.3%

expected #2 hit = 100%expected #2 hit = 100%


Double Speculation (DS)

If secondary speculation succeeds,performance loss is not so large.

#1

path

#2

path

#1

path

#1

path

continuespeculative execution

#1

path

recovery process

#2

path

#1

path

secondaryspeculation


Evaluation flow

hot-path detection(SIMCA)

thread codes• #1 path speculative thread• #2 path speculative thread• non-speculative thread

performanceestimator

path historyacquisition (SIMCA)

thread-codegeneration

path execution history

speculation hit ratiospeed-up ratio


SPP DPP

SPP DPP

Prediction success ratio

history length

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

20

40

60

80

100

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

20

40

60

80

100

succ

. rat

io (%

)su

cc. r

atio

(%)

forward_DCTforward_DCT

compresscompress


SPP DPP

SPP DPP

Prediction success ratio

history length

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

20

40

60

80

100

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

20

40

60

80

100

succ

. rat

io (%

)su

cc. r

atio

(%)

sweepsweep

killtimekilltime


SS DS

SS DS

Speed-up ratio

history length

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

1.0

2.0

0

1.0

2.0

3.0

4.0

speed

-up

ra

tio

speed

-up

ra

tio

forward_DCTforward_DCT

compresscompress

S100

P1only

1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100

P1only


SS DS

SS DS

Speed-up ratio

history length

1 2 3 4 65 7 8 9 10 11 12 13 14 15 160

2.0

3.0

0

1.0

2.0

3.0

speed

-up

ra

tio

speed

-up

ra

tio

sweepsweep

killtimekilltime

S100

P1only

1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100

P1only

1.0


Conclusions- Two-Path-Limited Speculative Multithreading -

• We proposed

- path prediction method and predictors

- speculation methods

for path-based speculative multithreading

• Preliminary performance estimation results are shown


Current and future works

• Accurate and detailed evaluation for various applications

SPEC 2000, MediaBench, …

• Integration to our Dynamic Optimization

Framework YAWARA


Current dynamic optimization projects• Computation-oriented:

– YAWARA: A meta-level optimizing computer system

– HAGANE: Binary-level multithreading

• Communication-oriented:– Spec-All: Aggressive Read/Write Access

Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic

Information


HAGANE:Binary-Level Multithreading


Background

• Multithread programming is not so easy.

→ Automatic multithreading system

However…

• Source codes are not always available.

→ Multithreading at binary level


STO(Static Translator

& Optimizer)

Source Binary Code

Multithreaded Binary Code(statically translated)

Process Memory Image

DTO(Dynamic Translator

& Optimizer)

Multithread Processor

Multithreaded Binary Code(dynamically translated)

AnalysisInfo

Execution Profile Info

Binary Translator & Optimizer System

ExecutionProfile


Thread Pipelining Model

Continuation

TSAG

Computation

Write-back

Continuation

TSAG

Computation

Write-back

Continuation

TSAG

Computation

Write-backTSAG = Target Store Address Generation

Thread i

Thread i+1

Thread i+2

- Loop iterations are mapped onto threads


mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]

bstrslti $v0[2],$v1[3],5000beq $v0[2],$zero[0],$ST_LL0addu $t0[8],$a0[4],$zero[0]addu $t1[9],$a1[5],$zero[0]addi $v1[3],$v1[3],1addi $a0[4],$a0[4],4addi $a1[5],$a1[5],4lfrkwtsagdaddu $t2[10],$sp[28],$zero[0]altsw $t2[10]tsagdl.s $f0,0($t0[8])l.s $f2,0($t1[9])l.s $f4,0($t2[10])mul.s $f0,f0,f2add.s $f4,$f4,$f0sttsw $t2[10],$f4

$ST_LL0:estrmov.s $f0,$f4jr $ra[31]

Example translation

Source Binary Code

Translated Code

・ Thread Management Instructions

・ Overhead code for multithreading

mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]

$BB1:l.s $f0,0($a0[4])l.s $f2,0($a1[5])mul.s $f0,f0,f2addiu $v1[3],$v1[3],1add.s $f4,$f4,$f0slti $v0[2],$v1[3],5000addiu $a1[5],$a1[5],4addiu $a0[4],$a0[4],4bne $v0[2],$zero[0],$BB1

$BB2:mov.s $f0,$f4jr $ra[31]

Cont.

TSAG

Comp.

W.B.


L1 Data Cache

L1 Instruction Cache

ExecutionUnit

CommunicationUnit

MemoryBuffer

Write-BackUnit

Thread Processing UnitExecution

Unit

CommunicationUnit

MemoryBuffer

Write-BackUnit

Thread Processing Unit

● ● ●

Superthreaded Architecture


m88ksim (SPECint95)

0

1

2

3Sp

eedu

p R

atio

4 8 16

Number of Thread Units

No Unroll

Unroll 4

Unroll 8

Unroll 16

•poor speedup ratios•loop unrolling does not affect the performance •number of iterations is quite small.


ijpeg (SPECint95)

0

1

2

3

4

5

Spee

dup

Rat

io

4 8 16


No Unroll

Unroll 4

Unroll 8

Unroll 16

•the thread code size is too small to hide the thread management overhead• loop unrolling is effective to achieve good speedup ratios• excessive loop unrolling causes performance degradation• number of iterations is not so large.


swim (SPECfp95)

0123456789

1011

Sp

eed

up

Rat

io

4 8 16


No Unroll

Unroll 4

Unroll 8

Unroll 16

• good speedup ratios• loop unrolling is effective to achieve linear speedup• number of iterations is large.


Conclusion-HAGANE-

• We have evaluated the binary-level multithreading using some SPEC95 benchmark programs.

• The performance evaluation results indicate:– the thread code size should be large enough to

improve the performance.– loop unrolling is effective for the small loop body.– excessive loop unrolling degrades performance


A Methodology ofBinary-Level Variable Analysis

for Multithreading

HAGANE@PDCS2004


Background and Objective

Usually, loop-iterations are interrelated through memory variables, such as induction ones.

Binary-level variable analysis method is strongly required for binary-level multithreading.

However, it is difficult to analyze this kind of dependency at binary level.


for (i = 1; i < N; i++) {z = i * 2;x = a[i-1];

y = x * 3;a[i] = z + y;

}

Example Binary Codelw $a1[5], 16($s8[30])lw $v1[3], 16($s8[30])lw $a0[4], 16($s8[30])sll $v1[3], $v1[3], 0x2addu $v1[3], $v1[3], $a2[6]lw $v0[2], 16($s8[30])lw $v1[3], -4($v1[3])addiu $v0[2], $v0[2], 1sw $v0[2], 16($s8[30])lw $v0[2], 16($s8[30])sll $a1[5], $a1[5], 0x1sll $a0[4], $a0[4], 0x2sll $v0[2], $v1[3], 0x1addu $v0[2], $v0[2], $v1[3]lw $v1[3], 16($s8[30])addu $a0[4], $a0[4], $a2[6]addu $a1[5], $a1[5], $v0[2]sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7]

Thread ji++

load a[i-1]

s tore a[i]

load a[i-1]

i++

s tore a[i]

Thread j+1

-4($v1[3])

0($a0[4])


Binary-Level Variable Analysis

(1) Register values are analyzed using data flow trees.

(2) When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register.

(3) Using the virtual registers, steps (1) and (2) are repeated.


$2#2+

$3#1+

$2#1

$5#1+

$4#0

$0 $0

Construction of Dataflow Tree

addiu $29#1, $29#0, -8

sw $0, 0($29#1)

addu $5#1, $0, $0

lw $2#1, 0($29#1)

addu $3#1, $5#1, $4#0

addiu $5#2, $5#1, 1

addu $2#2, $2#1, $3#1

sw $2#2, 0($29#1)

slti $2#3, $5#2, 100

bne $2#3, $0, L1


Example Normalization

$2#2+

$2#1sll

14

$7#1+

2

0 $4#0

$2#2+

$4#0 4

14*


Detection ofLoop Induction Variables

Loop induction variable is the register, which– has inter-iteration dependency, and– increases with a fixed value between iterations.

$V2#2+

1$V2#1

The concept of virtual register makes it possible to detect induction variables on memory.


Application

• 101.tomcatv of SPECfp95 Benchmark

• Fortran to C translator ver. 19940927

• GCC cross compiler ver 2.7.2.3 for SIMCA

• Data set: test

• The six most inner loops (#1-#6) are selected

• They have induction variables on memory


Speedup Ratios

9.804

1.643

5.178

1.800

3.5832.611

5.361

0

2

4

6

8

10

12

#1 #2 #3 #4 #5 #6 ALL

Loop

Spee

dup

rati

o


Conclusion -Binary-Level Variable Analysis-

• We proposed a binary-level variable analysis method.

• This method makes it possible to detect induction variables and the increment/decrement values.

• The detected information allows us to multithread binary codes; they may not be multithreaded without our algorithm.

• We attained up to 9.8 speedup by the multithreading.


Summary

• Dynamic optimization projects at our laboratory

• The results show the performance improvement quantitatively in each project


What’s the next step of computer architecture research?

• from performance to reliability? or low power?

e.g. dependable computing• architecture for new device technologies?

e.g. quantum computing

However….

if we stick to conventional high-performance computing research,

what’s the promising way?

Documents

Profile-Based Dynamic Optimization Research for Future Computer Systems