14
ALU Array based Reconfigurable ALUArray based Reconfigurable Accelerator for Energy Efficient Executions d hd hd k k d Koji Inoue, Hamid Noori, Farhad Mehdipour, Takaaki Hanada, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan ‡School of Electrical and Computer Engineering, University of Tehran

Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

ALU Array based ReconfigurableALU‐Array based Reconfigurable Accelerator for Energy Efficient Executions 

† d ‡ h d hd † k k d †Koji Inoue†, Hamid Noori‡, Farhad Mehdipour†, Takaaki Hanada†, 

and Kazuaki Murakami†

†Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan

‡School of Electrical and Computer Engineering, University of Tehran

Page 2: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

OutlineOutline

• IntroductionIntroduction• ADEXOR: Adaptive Extensible Processor

– Overview– Microarchitecture– Coarse‐grained Reconfigurable Functional Unit

• EvaluationEvaluation• Conclusions

2

Page 3: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Motivation and SolutionMotivation and Solution

• Embedded processors have to achievep– Low cost– High‐performanceL l ti– Low‐power or low‐energy consumption

• Key point– How can processors adapt to target applications?How can processors adapt to target applications?

• Solution: ASIP w/ Re‐configurability– Application specific ISA

• Provide custom instructions (CIs)– Implement re‐configurable FUs

3

Page 4: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

ADaptive EXtensible processOR(ADEXOR)(ADEXOR)

• Has a coarse‐grained re‐configurable functional unitff “ l ”

400680 bi $25 $25 1

• Supports efficient “Multi‐Exits CIs”• Achieves high‐performance and low energy

Register FileRFU

ConfigurationM

Indexed by mtc1or sequencer

400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,1 ID/EXE RID/EXE Reg

CRFU

Memory

ALU

4006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 bgez $10,4006f04006d0 xori $13,$13,1

ID/EXE Reg

MUX Counter

EXE/MEM Reg

Triggered by mtc1 orsequencer

4006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10 4006f0

4

GPP: General Purpose Processor

CRFU: Coarse‐grained Reconfigurable Functional Unit

GPP Augmented HW4006e0 bgez $10,4006f0....

Hot Basic Block

Page 5: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

CRFU MicroarchitectureCRFU Microarchitecture

• 16 FUs controlled by configuration bits16 FUs controlled by configuration bits

• MUX‐base interconnection between FUs

l d b f d• Early stage data can be transferred to output ports

Row 1

Configurationbits

Configurationbits

R 5

Adder/subtractor

AND OR XORBarrelShifter

Configurationbits

FU FU FU FU

Row 5

Page 6: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Supporting Multi‐Exits Custom Instructions (MECIs)Supporting Multi Exits Custom Instructions (MECIs)

MultipleMultiple‐‐Exits Custom InstructionExits Custom InstructionMultipleMultiple Exits Custom InstructionExits Custom InstructionConditional Execution + Hot‐Path Selection

#Required nodes: 16#Required nodes: 16adpcm

ExitExit

ExitExit

6

Assume 16 nodes can be included in one CI in maximum

Page 7: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Experimental Setup (1/2)Experimental Setup (1/2)

I 1Issue 1-way

L1-Instruction Cache 32K, 4 way, 1 cycle latency, miss penalty 20 cycles

L1- Data Cache 16K, 4 way, 1 cycle latency, miss penalty 20 cycles

ALUs 1 integer unit, 1 floating point unit

Multiplier 1 Integer (5 cycles)

Divider 1 Integer (8 cycles)

Branch predictor bimodal

Branch prediction table size 256

Extra branch misprediction 3

Register File 4-read ports, 2-write ports

Clock Frequency 135 MHz

Base Processor Configuration

7

Base Processor Configuration

Page 8: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Experimental Setup (2/2)Experimental Setup (2/2)Reg0 ………………………………...

.Reg31

From decode stage

Triggered bymtc1or sequencer

DEC/EXE Pipeline Registers

CounterFrom decode stage

CRFU Input RegsEn

ALU MUL/DIV CRFU

EXE/MEM Pi li R i t

Counter

ConfigMemory

Triggered bymtc1or sequencer

EXE/MEM Pipeline Registers

Result bus

q

arch1: (4‐read/2‐write)•Clock freq: 135MHz•RF read/write access

arch2: (8‐read/4‐write)•Clock freq: 130MHz

•RF read/write access Input: 5, 6, 7, or 8 +1 extra cycleOutput: 3 or 4  +1 extra cycleOutput: 5 or 6  +2 extra cyclesCRFU ti

•RF read/write access Input: no extra cycleOutput: 5 or 6  +1 extra cycle

•CRFU execution•CRFU executionarch‐1‐var: variable (1 or 2 cycles)arch‐1‐fix: 2 cycles

arch‐2‐var: variable (1 or 2 cycles)arch‐2‐fix: 2 cycles

8

Page 9: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Performance EvaluationPerformance Evaluation5

arch1 var

4

4.5arch1-vararch2-fixarch2-var

2 5

3

3.5

Spee

dup

1.5

2

2.5

1

sicmath

itcountsqso

rtsu

san

cjpeg

djpegdijk

stra

patrici

ablowfis

hrijn

dael

gsea

rch sha

adpc

m crc fftgsm

avg-se

qvg

-mtc1

9

basi

bitc d p blo rstr

ings a av avg

Page 10: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Energy ConsumptionEnergy Consumption

Pros ConsPros.

• Low activity of hardware components

Cons.

• RFU configuration– Accessing the config.  

– I‐Cache, Bpred

– Decoder

– Register File

Memory

– Setting control signals in the RFU– Register File

– Functional Unit

• Higher I‐Cache hit rates

• Increased complexity– Communication between the 

processor’s data path and the– Reduce the energy for off‐

chip accesses

processor s data‐path and the RFU

10

Page 11: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Total Energy ReductionTotal Energy Reduction

80

60

70

n (%

)

clk-gating-arch2-vararch2-vararch2-fixarch1-var

40

50

ergy

redu

ctio

n

10

20

30

Tota

l ene

0

10

sicmath

tcountsqso

rtsu

san

cjpeg

djpegdijk

stra

patrici

ablowfis

hrijn

dael

gsea

rch sha

adpc

m crc fftgsm

avg-se

qvg

-mtc1

11

basic

bitc d p blo rijstr

ings a av avg

Page 12: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

Temperature Analysis

48130MHz 260MHz 390MHz 520MHz 650MHz

Temperature Analysis

47

47.5

48

(℃)

FU FUFU FU

CRFU Floor Plan(1.7x1.7 [mm2])

46

46.5

mpe

ratu

re

FU

FU

FU

FU

FU FUFU

FU FU FU

FUFU

45

45.5Tem

12

Page 13: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

ConclusionsConclusions

• ADEXOR: Adaptive Extensible ProcessorADEXOR: Adaptive Extensible Processor– Has a coarse‐grain reconfigurable functional unit

Supports multi exit custom instructions– Supports multi‐exit custom instructions

• Performance / Energy Analysis( )– 5X speed up (best case)

– 60% energy reduction (best case)

• Future Work– Extend for 3D‐IC Implementation

13

Page 14: Accelerator for Energy Efficient Executionscpc.ait.kyushu-u.ac.jp/~koji.inoue/paper/2009/... · ALUs 1 integer unit, 1 floating point unit Multiplier 1 Integer (5 cycles) Divider

AcknowledgementAcknowledgement

• This research was supported in part byThis research was supported in part by – New Energy and Industrial Technology Development Organization

– The chip fabrication program of VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd and Dai Nipponcollaboration with Hitachi Ltd. and Dai Nippon Printing Corporation.

14