53
General Overview of An Adaptive Dynamic Extensible Processor Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06)

General Overview of A n Adaptive Dynamic Extensible Processor

  • Upload
    kelvin

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

General Overview of A n Adaptive Dynamic Extensible Processor. Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart. Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06). Agenda. Background Research goal General overview of the architecture - PowerPoint PPT Presentation

Citation preview

Page 1: General Overview of A n Adaptive Dynamic Extensible Processor

General Overview of An Adaptive Dynamic Extensible Processor

Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart

Kyushu University

Department of Informatics

Workshop on Introspective Architecture (WISA06)

Hamid Noori
Page 2: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Agenda

Background Research goal General overview of the architecture

Modes of operation Profiler Accelerator Sequencer

Generation of Custom Instructions Configuration Data for the Accelerator Experiments and Results Conclusions & Future work

Page 3: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Background

GPP ASIC ASIP Ext. Proc. Our Proc.

Power consumption

× ◎ ◎ ○ ○

Performance (Specific)

× ◎ ○ ○ ○

Performance (General)

○ × × × ○

Flexibility ◎ × × × ◎

Design time ○ × × △ ○

Design cost ○ × △ △ ○

Programmability ◎ × ◎ ○ ◎

Productivity ◎ × △ △ ◎

Page 4: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Some definitions

Hot Basic Block (HBB) A basic block which execution frequency is greater than a

given threshold specified in the profiler Custom Instructions (CIs)

Are the extended Instruction Set Architecture (ISA) that are executed on the ACC

Accelerator (ACC) Custom hardware for executing CIs

Training mode Operation mode for detecting HBBs and generating CIs

Normal mode Normal operation mode where CIs are executed on the ACC

Page 5: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Research Goal Proposal of an Adaptive Dynamic Extensible

Processor for Embedded Systems Custom instructions are adaptable to the applications Custom instructions are detected and created during

execution/training Generation of custom instruction are done transparently

and automatically Advantages of the novel approach

Higher performance than GPPs Higher flexibility compared to Extensible Processors Shorter TAT and cheaper design and verification cost

compared to ASIPs and Extensible Processors

Page 6: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

General overview of the architecture

Adaptive Dynamic Extensible Processor

Base Processor

Reg FileFetch

Decode

Execute

Memory

Write

Augmented Hardware

ACC

Profiler

Sequencer

N-wayin-order

general RISC

Detects start addresses of

Hot Basic Blocks (HBBs)

Executes Custom

Instructions

Switches between main processor and

ACC

Page 7: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

General overview of the architecture

Modes of operation Training mode

Profiling Detecting start address of Hot Basic Blocks (HBBs) Generating Custom Instructions Generating Configuration Data for the ACC Binary rewriting Initializing the Sequencer Table♦ Online

Needs a simple hardware for profiling All tasks are run on the base processor

♦ Offline Needs a PC trace after taken branches/jumps

Normal mode Profiling (optional) Executing Custom Instructions on the ACC and other parts of the

code on the base processor

Page 8: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Components

Register File

ID/EXE Reg

Accelerator

Multi-Context Memory

Cache

Functional Unit

Mux SequencerSequencer

Table

EXE/MEM RegProfiler

DMA

Profiler Table (HWT)GPP Augmented HW

Online Training

Page 9: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Operation modes

Applications

ProcessorProfiler

ACC

Training Mode

SequencerProcessor

Profiler

ACC Sequencer

Running Tools for Generating

Custom Instructions, Generating

Configuration Data for ACC

and Initializing Sequencer

Table

Training Mode Normal Mode

ProcessorProfiler

ACC Sequencer

Monitors PC and

Switches between

main processor and ACC

Executing CIs

ApplicationsApplications

Binary Rewritin

g

Profiler

Binary-Level

Profiling

Detecting Start

Address of HBBs

Page 10: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

ProfilerCurrent PC Previous PC

Compare

If greater than instruction length

Is Current PC in the table?

No

Yes

Add it as a new entry and set the counter to one.

Increment the counter

Basic Block Start Addr

(BBSA)

Counter

Profiler Table

NoNothing

Yes

After a taken branch or jump we look at the BBSA to see if the target PC is on the table. If it is a miss we include this address and initialize the counter to 1, otherwise we increment its value.

Page 11: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Detecting Start Addr of HBBs

400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: bne $2,$0,400db8 <usqrt+0xa8>400d58: srl $2,$2,0x1e 400d60: lw $3,0($29) 400d68: addu $4,$4,$2 400d70: sll $8,$8,0x2 400d78: sll $6,$3,0x1 400d80: sll $3,$3,0x2 400d88: addiu $3,$3,1400d90: sltu $2,$4,$3 400d98: sw $6,0($29)

Not taken part

BBSA Counter

Profiler Table

HBBSA Counter

HBB Table

BTA

Taken Freq

Exec Freq

subHot?

Counter > Threshold

400d10 500

400db8 500X

Threshold = 100

400db8 50

HBB

Page 12: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Size of Profiler Table

Exec Freq Threshold 128 256 512 1024 2048

adpcm (enc) 28 28 28 28 28

basicmath 126 125 121 120 118

cjpeg 290 216 192 127 114

djpeg 163 154 108 48 35

lame 1109 978 929 852 537

dijkstra 117 116 103 101 101

patricia 290 290 255 228 216

blowfish 87 87 84 23 17

rijndael(enc) 107 107 106 37 37

sha 73 73 61 17 13

crc 37 37 36 36 36

fft 68 68 65 65 65

gsm 364 362 329 328 319

Number of Basic Blocks with Exec Freq more than Threshold

Page 13: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Accelerator (ACC) ACC is a matrix of Functional Units (FUs) ACC has a two level configuration memory

A multi-context memory (keeps two or four config) A cache

FUs support only logical operations, add/subtract, shifts and compare

ACC updates the PC ACC has variable delay which depends on

size of Custom Instruction

Page 14: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Connecting ACC to the Base Processor

Decoder

DEC/EXE Pipeline Registers

FU1 FU2 FU3 FU4 ACC

Reg0 ………………………………………………………………. Reg31

Sequencer

EXE/MEM Pipeline Registers

Config Mem

Page 15: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Connecting ACC to the Base Processor

DEC/EXE Pipeline Registers

FU1 FU2 FU3 FU4 ACC

Reg0 ………………………………………………………………. Reg31

Sequencer

EXE/MEM Pipeline Registers

Config Mem

Decoder

Sequencer

Page 16: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Sequencer The sequencer mainly determines the microcode execution

sequence. Selects between decoder and config memory for reading RF Selects between the output of Functional Unit and Accelerator Distinguishes when to switch between different contexts of multi-

context memory Determines when to load configuration data from cache to multi-

context memory. Checks the configuration data of custom instruction

If it is in multi-context memory, custom instructions will be executed on the accelerator

If it is not in multi-context memory If there is enough time to load it from cache to multi-context memory,

loads it and execute CI on the ACC If there is not enough time, the original code is executed.

Page 17: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Generation of Custom Instructions

Custom instructions Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP

and all other fixed point instructions Simple algorithm for generating custom instructions

HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on

the base processor (in online training mode)

Page 18: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Generating Custom Instructions4052c0 addiu $29,$29,-324052c8 mov.d $f0,$f124052d0 sw $18,24($29)4052d8 addu $18,$0,$64052e0 sw $31,28($29)4052e8 sw $16,16($29)4052f0 mfc1 $16,$f04052f8 mfc1 $17,$f1405300 srl $6,$17,0x14405308 andi $6,$6,2047405310 sltiu $2,$6,2047405318 addu $6,$6,$18405320 sltiu $2,$6,2047405328 lui $2,32783405330 and $17,$17,$2405338 andi $2,$6,2047405340 sll $2,$2,0x14405348 or $17,$17,$2405350 mtc1 $16,$f0405358 mtc1 $17,$f1405360 lw $31,28($29)405370 lw $16,16($29)405378 addiu $29,$29,32405380 jr $31

Finding the biggest sequence of instructions in the HBB that can be executed on the ACC

Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency

Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency

Rewriting object code if instructions have been moved

Moving instructions, should not modify the logic of the application

Custom instruction generation is done without considering any other constraints.

Page 19: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Generating Custom Instructions

Block 3 (B3) is selected as the biggest instructions sequence that can be executed on the ACC

Block 2 (B2) can not be executed on ACC

Block 1 (B1) can be executed on ACC

If there is no flow and anti-dependency between B1 and B2 exchange them.

This is done for B4 and B5.

Supported instr(s) (B1)

Not supported

instr(s) (B2)

Not supported

instr(s) (B4)

Supported instr(s) (B3)

Supported instr(s) (B5)

Supported instr(s) (B1)

Not supported

instr(s) (B2)

Supported instr(s) (B3)

Not supported

instr(s) (B2)

Supported instr(s) (B3)

Supported instr(s) (B1)

Page 20: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Example 1400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $29,$2,0x1e 400d58: lw $3,0($29) 400d60: addu $4,$4,$3 400d68: sll $8,$8,0x2 400d70: sll $6,$3,0x1 400d78: sll $3,$3,0x2 400d80: addiu $3,$3,1 400d88: sltu $2,$4,$3 400d90: sw $6,0($29) 400d98: bne $2,$0,400db8 <usqrt+0xa8>

Customized Instruction 1

Customized Instruction 2

Page 21: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Example 2 (rewriting obj code)

400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: addu $7,$0,$0 400d28: lui $9,49152 400d30: sll $4,$4,0x2 400d38: and $2,$8,$9 400d40: srl $2,$2,0x1e 400d48: lw $22,0($29) 400d50: addu $4,$4,$2 400d58: sll $8,$8,0x2 400d60: sll $6,$3,0x1 400d68: sll $3,$3,0x2 400d70: sltu $2,$4,$3 400d78: bne $2,$0,400db8 <usqrt+0xa8>

Page 22: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

ACC Config Data Generation Flow

Profiler

Base Processor

Detecting Start Addr of HBBs

Reading HBBs from Obj Code

DFG

Simplescalar (PISA

Configuration)

Mibench Applications

2

3

4

1

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1

A Custom Instruction

Data Flow Graph

ACC Map

5

Page 23: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Preliminary Performance Evaluation

400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $2,$2,0x1e

FU FU

FU FU

FU FU

FU FU

FU FU

FU FU

FU FU

FU FU

FU

FU

FU

FU

Depth = 31st row = 1 clock

0.5 clock 0.5 clockTotal = 2 clock

9 – 2 = 7 clock cycles

7 * freq = reduced clock cycles

7 * 50K = 350K clock cycles

Page 24: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Results – Number of CI considering their length

0

10

20

30

40

50

60

70

basic

math

_la

rge_64K

cjp

eg_32K

djp

eg_8K

lam

e_32K

dijk

str

a_64K

patr

icia

_128K

blo

wfish_128K

rijn

dael-enc_128K

rijn

dael-dec_128K

sha_64K

adpcm

_enc_2000K

adpcm

_dec_2000K

crc

_2000K

fft1

28K

fft-

inv_128K

gsm

128K

(cod)

Number

1~5

6~10

11~15

16~20

21~25

26~30

31~35

36~40

41~45

Length of CIs

82

Page 25: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Results – Percentage of CIs considering their length

0

10

20

30

40

50

60

70

80

90

100

basic

math

_la

rge_64K

cjp

eg_32K

djp

eg_8K

lam

e_32K

dijk

str

a_64K

patr

icia

_128K

blo

wfish_128K

rijn

dael-enc_128K

rijn

dael-dec_128K

sha_64K

adpcm

_enc_2000K

adpcm

_dec_2000K

crc

_2000K fft

fft-

inv

gsm

(cod)

Percent

1~5

6~10

11~15

16~20

21~25

26~30

31~35

36~40

41~45

Length of CIs

Page 26: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

More info on Custom InstructionsApp. Exe Instr (M) Threshold (K) # HBB # CI % Speedup % code size % exec time

basicmath_large 170 64 37 18 19.6 1.4 31.6

cjpeg 101 32 42 52 27 1.5 44

djpeg 25 8 22 32 31.5 0.8 48

lame 260 32 142 104 8.6 1.1 16

dijkstra 254 64 34 20 21.4 0.7 38.6

patricia 217 128 51 17 7.8 0.6 14.6

blowfish 260 128 18 28 33 2.7 59

rijndael (enc) 260 128 63 92 36 6.1 51.7

rijndael (dec) 259 128 63 78 36 4.5 51.7

sha 154 64 9 13 52 1.1 73

adpcm (enc) 260 2000 14 8 21 0.32 42

adpcm (dec) 265 2000 12 5 24 0.24 41

crc 265 512 4 2 20 0.1 44.9

fft 189 128 43 19 18.6 0.93 30

fft (inv) 190 128 43 19 18.6 0.93 30

gsm (cod) 265 128 34 41 25.1 1.53 47.2

Average 39 34 25 1.53 41.45

Page 27: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Conclusions An Adaptive Dynamic Extensible Processor

Training mode and Normal mode Advantages

It has s simple profiler CI are detected and added after production There is no need to a new compiler There is no need to new opcode for CIs There is no penalty for absence of CI config data Lower design cost and shorter design time

By accelerating a small part of code which has a high execution frequency an average 25% speedup improvement can be obtained. Comparing a single issue processor speedup improvement ranges from 7.8% to 52%.

Page 28: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Future Work

Linking HBBs Providing more details on the architecture

(Accelerator, sequencer, etc) Designing an Accelerator to support

conditional execution Developing a complete framework Extending ACC for floating point operations Substituting the in-order base processor with

an out-of-order

Page 29: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Thank you for your listening

Page 30: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Example

Application X CIx1, 100, input = 3 CIx2, 200, input = 6 Total executed instruction = 400,000

Application Y CIy1, 50, input = 4 CIy2, 400, input = 6 Total executed instruction = 800,000

Input < 5

40050)2200()2100(

50)2100(

xx

x

Page 31: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Mapping Tool - Example

2

3

4

1

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1

A Custom Instruction

Data Flow Graph

ACC Map

5

Page 32: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Design: A Quantitative Approach

RFU or Accelerator is a matrix of ALUs No of Inputs No of Outputs No of ALUs Connections Location of Inputs & Outputs

Some definitions: Considering frequency and weight in measurement

CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight)

Rejection: Percentage of CI that could not be mapped on the RFU Coverage: Percentage of CI that could be mapped on the RFU Basic Blocks:   A sequence of instructions terminates in a control

instruction Hot Basic Blocks: A basic block executed more than a threshold

Page 33: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Inputs (no constraint)

Input No Analysis-Optimized Version

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19

Input No.

Co

vera

ge

Series1

96.3789.37 98.48

8

Page 34: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Outputs (no constraint)

6

Output No. Analysis- Optimized Version

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Output No.

Co

vera

ge

Series1

96.58

Page 35: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Node No (Input=8, Output=8)

Node No. Analysis-Optimized Version

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Node No.

Co

vera

ge Coverage based

on Total CIs

Coverage basedon remaining CIs

94.74

16

Page 36: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Width (Inp=8, Out=8, Node=16)

ACC Width Analysis-Optimized Version

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

ACC Width

Co

vera

ge

Series1

97.6595.65

6

Page 37: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Depth (Inp=8, Out=8, Node=16)

ACC Height Analysis-Optimized Version

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14

ACC Height

Co

vera

ge

Series16

93.41

Page 38: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU Configuration

Input=8 Output=8 Node=16 Width = 6,4,3,2,1 Depth = 5

Page 39: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

General overview of RFU (Architecture 1) Inputs are applied to the first

row Outputs of each row are

connected only to the inputs of the subsequent row

MOVE is used for transferring data

Rejection is 22.47%

Page 40: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

General overview of RFU (Architecture 2) Distributing Inputs in different

rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1

Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1

Rejection is 9.52%

Page 41: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Functional Units

Types for FUs: Type1: Logical (xor, nor, and , or) Type2: add, sub, compare Type3: shift (left/right)

Number of each type in the RFU Type 1 = 6 Type 2 = 14 Type 3 = 9

Page 42: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

RFU with 8 outputs

Reg

Sequencer/control bits

RegRegReg

Accelerator

FU1-Output

FU2-Output

FU3-Output

FU4-Output

Sequencer/control bits

Page 43: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Control Bits & Immediate Data

287 bits are needed as Control Bits for Multiplexers Functional Units

204 bits are needed for Immediates Each CI configuration needs (247+204 = 491

bits)

Page 44: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

CI Configuration Memory

2K x 1-bit multi-context memory 4 CI configuration

8K x 1-bit cache 16 CI configuration Total 20 CI configuration can be kept in

configuration memories

Page 45: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Extension of Custom Instructions over HBBs – Motivating Example

B1

S1

B2

S2

B3

B4

S3

B5

S4

B6

S5

J1

B7

S6

J2

B8

B9

S7

B10

S8

S9

B11

J3

B12

S10

Name of the block

No. of Exe. (M)

No. of Instr

B1 11.6 5

B2 5.8 1

B3 5.8 4

B4 8.6 3

B5 5.2 3

B6 5.6 1

B7 5.8 2

B8 11.6 2

B9 11.6 6

B10 11.6 2

B11 11.6 4

B12 5.8 3

Page 46: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Multi-Exit Custom Instructions

Page 47: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Conclusions

Adaptive Dynamic Extensible Processor Binary Profiler RFU (Inp=8, Out=6, Nodes=16, Width=6,4,3,2,1 - Depth=5) Sequencer

Adaptive Dynamic Extensible Processor No design time No extra read port and write port No design and verification cost No compiler No new opcode No penalty for absence of configuration data of custom

instruction in multi-context memory.

Page 48: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Custom Instruction

Generated from HBBs Using HBB table Object code

Custom instruction can include logical operations add/sub Shift At most one store At most one control instruction (jump/branch) No load No floating point instructions

New object code Logically is equivalent

BBSA Counter

Profiler Table

Page 49: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Processor modes (1/2)

Training mode Profiling applications Detecting critical region of code Generating DFG for critical regions Generating custom instruction from DFGs Generating new object code Generating data for accelerator configuration memories

and initializing sequencer table Training can be done at the gap between two consecutive

execution of the application if possible, otherwise just once before processor starts its normal operation

Page 50: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Processor modes (2/2)

Normal mode Profiling applications Using the data generated in training mode to

execute custom instructions on the accelerator.

Critical regions of the code are executed as custom instructions on the accelerator and the remaining part of the code are executed deploying the processor functional unit as usual.

Page 51: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Online Profiler-Components

Profiler Hardware Software

Hardware Comparator: compares current value and previous value

of Program Counter (PC). Profiler Table: In this table for each taken branch/jump

target address, there is a corresponding counter. The counter, counts how many taken branch or jumps has been done to the target address.

Software Hot Basic Block (HBB) detector

*Basic block is a sequence of instructions that ends up in a branch or jump.

Page 52: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

Architecture Advantages

No compiler No new opcode No penalty for absence of configuration data of

custom instruction in multi-context memory. The ability to use processor functional unit and

accelerator in parallel. Custom instruction detection and execution are

done fully automatically and transparently.

Page 53: General Overview of A n Adaptive Dynamic Extensible Processor

WISA06@AustinKyushu University

General overview of the architecture Base processor (1,2 or 4-way in-order

general RISC) Profiler

Detects start address of Hot Basic Blocks (HBBs) Accelerator (ACC)

Executes Custom Instructions Sequencer

Determines the microcode execution sequence using the sequencer table