A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator

Preview:

DESCRIPTION

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator. Farhad Mehdipour , Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami Kyushu University, Fukuoka, Japan. O UTLINE. Introduction Problem Definition and Basic Concepts - PowerPoint PPT Presentation

Citation preview

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

Kyushu University, Fukuoka, Japan

2

OUTLINE

Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

3

OUTLINE

Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

4

Designing Embedded Systems

Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2

LD/ST: Load / Store

CFU: Custom Functional Unit

5

Extensible Processors

Improving the performance and energy efficiency of the embedded processors

maintaining compatibility and flexibility. Using an accelerator for accelerating frequently executed portions of

applications Accelerator implementations

custom hardware (such as ASIP or Extensible Processors) reconfigurable fine/coarse grain hw

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2LD/ST: Load / Store

CFU: Custom Functional Unit

6

Custom Instructions

Critical segments Most frequently executed portions of the applications

Instruction set customization hardware/software partitioning (Identifying critical segments in applications)

Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU)

A CI is represented as a DFG

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1A Custom Instruction

7

Reconfigurable Processors

Using a reconfigurable accelerator (RAC) instead of custom functional units

Adding and generating custom instructions after fabrication

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2RFU Config Mem

CFU: Custom Functional Unit

RFU: Reconfigurable Functional Unit

8

Baseline Processor

ALU

Register File

RAC...

...

...

ConfigurationMemory

How a Reconfigurable Processor Works

GPP: General Purpose Processor

RAC: Reconfigurable Accelerator

400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ... Hot Basic Block

Reconfigurable Processor

9

OUTLINE

Introduction Problem Definition and Basic Concepts Overall Design Approach Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

10

Designing a Reconfigurable Accelerator (RAC)

Multitude of design parameters e.g. number of functional units, input/output ports, type of

functional units and etc.

Several design parameters high complexity in the RAC design the requirements for a methodological approach

A major challenge finding the right balance between the different quality

requirements (e.g. speedup, area, energy consumption)

11

Traditional Design Process

Describing a reference model Verifying the model functional correctness Obtaining a rough estimates of performance Manual or semi-automatic generation of several

alternative designs Choosing the most suitable design based on various

performance metrics

12

Design Space Exploration

Design Space Exploration (DSE) the process of analyzing several “functionally equivalent”

implementation alternatives to identify an optimal solution Can become too computationally expensive

Example: a design with four tasks on an architecture with three processing

modules and each have four possible configurations results in 500 design alternatives

Our Hybrid (analytical & quantitative) DSE approach reduces substantially the design time and effort

13

Assumptions

RAC: a matrix of FUs the width equal to w and height equal to h basically has a combinational logic

Basic Elements: Functional Units (logic resources) Multiplexers (routing resources)

FUs in RAC are fully connected Except the lack of connections from lower to upper rows

RAC Input Ports

RAC Output Ports

2

2FU

1

1FU

2

1FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

w

14

Assumptions

CIs (DFGs) are mapped onto the RAC and executed at runtime

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1A Custom Instruction

Data Flow Graph

2

3

4

1

RAC Initial Architecture

5

15

OUTLINE

Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

16

The Problem

Designing a RAC while optimizing Speedup & Area Design parameters

the RAC’s width and height the number of FUs, the number of input and output ports

Speedup

CCyclesOnRATotalClock

ocessorseCyclesOnBaTotalClockSpeedup

Pr

17

Design Methodology- Tool Chain

Our Approach: A Hybrid (Analytical & Quantitative) DSE Approach

CI ExtractorCIs

(DFGs)

CI (DFG)Analyzer

Required Informationfor DSE

Design SpaceExploration

Accelerator FinalSpecification

Adjusting RACArchitecure

Specification ofAccelerator

Application

18

Problem Formulation

: the fraction of DFGs (CIs) in applications which have the width equal to i and height equal to j.

: percentage of application execution time concerns to all DFGs with the width of 4 and height of 3 is 7%.

ij

W

i

H

j

ij CPUCCCPUCC

1 1

)()(

jiFUCPUCC ij

ij )()(

2

4

6

1 3

5

2

4

5

1 3

ij

%743

19

Problem Formulation

ij

W

i

wh

H

j

ij

wh RACCCRACCC

1 1

)()(

: the number of clock cycles for executing DFG(i,j) on a RAC (w,h)

)( wh

ij RACCC

if one or both dimensions of a DFGs are greater than RAC’s dimensions,

temporal partitioning: divides a DFG into time exclusive smaller DFGs

reconfiguration overhead time for loading subsequent partitions of a DFG from the configuration memory onto the RAC

20

Problem Formulation

hjwi ,)3hjwi ,)2

)()( wh

wh

ij RACDRACCC hjwi ,)1

hjwi ,)4

RACh

w

RAC

DFG

RAC

h

w

RAC

DFGh

w

RAC

DFG

21

Delay of RAC

wkMUXDFUDRACDh

i

ki

h

i

ki

wh ,...,1,)()()(

1

11

)( kjMUXD = delay of mux(

kjs2 to 1)

kjmk

js 2log )1()1( wwjmkjRAC Input Ports

RAC Output Ports

22FU

1

1FU

21FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

wRAC’s Delay Path

22

Area of RAC

= delay of mux(kjs2 to 1)

w

i

h

j

ij

w

i

h

j

ij

wh MUXAFUARACA

1

1

11 1

)(*2)()(

)( ijMUXA

RAC Input Ports

RAC Output Ports

2

2FU

1

1FU

2

1FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

w

23

Optimization Problem

Objective: Maximize ( ) )

ij

W

i

wh

H

j

ij

ij

W

i

H

j

ij

wh

wh

RACCC

CPUCC

RACCC

CPUCCRACS

1 1

1 1

)(

)(

)(

)()(

)( whRACS

24

OUTLINE

Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

25

Designing RAC for a Reconfigurable Processor

AMBER: a reconfigurable processor targeted for embedded systems Main components

a base processor (general RISC processor) Sequencer and a coarse-grained reconfigurable functional unit (RFU)

Register File

ID/EXE Reg

RFU

ConfigurationMemory

ALUs

MUX SequencerSequencer

Table

EXE/MEM Reg

GPP Augmented HW

AMBER’s Architecture

26

Quantitative Approach

22 applications of Mibench were attempted Applications were executed on Simplescalar and profiled Hot segments and DFGs are extracted from the applications No limitation in the RFU’s initial architecture DFGs are mapped on the initial RFU architecture Feedbacks of mapping

Specification of the designed architecture for AMBER’s RFU

27

Hybrid DSE Approach

FU and various size multiplexers Synthesized using Hitachi 0.18um Measuring delay and area

DFGs are analyzed to extract required information quantitatively

Reconfiguration penalty time: 1 clock cycle

The base processor clock frequency: 166MHz

28

Speedup Evaluation

14

710

1316

191 4 7

10

13

16

19

0

0.5

1

1.5

2

2.5

Speedup

Width

Height

0-0.5 0.5-1 1-1.5 1.5-2 2-2.5

29

Speedup Evaluation

Increasing RAC’s width more parallelism

Widths larger than 6 negative effect of growing the number of muxes and their sizes no more speedup achievable

Increasing RAC’s Height longer delay speedup declines and area increases

30

Comparison

31

OUTLINE

Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion

32

CONCLUSION

Hybrid DSE approach Uses realistic data from the attempted applications Substantially reduces design time and designer efforts Can be used for shrinking a large design space Easily extendable to apply new design parameters More suitable where the new applications are added

Thanks for your attention!

Recommended