Faculty of Sciences and Technology University of Algarve, Faro João M. P. Cardoso April 30, 2001...

Preview:

Citation preview

Faculty of Sciences and TechnologyUniversity of Algarve, Faro

João M. P. Cardoso

April 30, 2001

IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert Park, CA, USA

A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units

A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units

Portugal

IndexIndex

Introduction

Temporal Partitioning

Problem Definition

New vs Previous Approach

Algorithm Working Through an Example

Experimental Results

Related Work

Conclusions

Future Work

IntroductionIntroduction

“Virtual Hardware”: Reuse of devices Save silicon area View “unlimited resources” Enabled by the dynamically reconfigurable FPGAs

Two concepts: Context switching among functionalities Allowing a large “function” to be executed

FPGA devices allowing virtualization: off-chip configurations on-chip configurations

Several research efforts…

IntroductionIntroduction

Answers: Temporal Partitioning Sharing of Functional Units

Goal: combining the two...

dx

+

u

-

u

-

dx

+

u_1

x y

dxx

x_1

dxu

y_1

+

y<< 1 << 1

Size larger than the available reconfigware area?

Temporal PartitioningTemporal Partitioning

uxdxx u

aux1

+

x_1

dx

y_1

+

y<< 1

time

Temporal PartitioningTemporal Partitioning

aux1

dx

-

u

-

dx

+

u_1

y

<< 1

time

Temporal PartitioningTemporal Partitioning

aux1

+

ux

dxx

x_1

dxu

y_1

+

y<< 1

aux1

dx

-

u

-

dx

+

u_1

y

<< 1

time

Temporal PartitioningTemporal Partitioning

Create temporal partitions to be executed by time-sharing the device

Netlist level (structural) Difficulties when dealing with feedbacks Loss of Information Flat structure Intricate for exploiting sharing of functional units

Behavioral level (functional) Loops can be explicitly represented Better design decisions “A must” for compilers for reconfigurable computing

Problem DefinitionProblem Definition

But, if we decrease the needed area by sharing functional units?

Simultaneously Temporal Partitioning and sharing of Functional Units

THE PROBLEM:

Given a dataflow graph (representing a behavioral description), a library of components,...

Map the dataflow graph onto the available resources of the FPGA device: Considering sharing of Functional Units Considering Temporal Partitioning Decreasing the overall execution latency

New vs Previous ApproachNew vs Previous Approach

Previous

Simultaneously Temporal

Partitioning and High-Level Synthesis

Component Library

ConstraintsDFG, CDFG

Circuit-generation,

Logic Synthesis

Temporal Partitioning

High-Level Synthesis

Component Library

Circuit-generation,

Logic Synthesis

ConstraintsDFG, CDFG

New

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Suppose the following dataflow graphSuppose the following dataflow graph Consider:

Area(+) = 1 cell Area(x) = 2 cells Delay(+) = 1 control step (cs) Delay(x) = 2 cs

Total area of the DFG: 8 cells

Available Area: 3 cells

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Calculate ASAP and ALAP valuesCalculate ASAP and ALAP values

Node 0 1 2 3 4 5ASAP 0 0 1 0 2 3ALAP 1 1 2 0 2 3

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Identify the critical pathIdentify the critical path

Node 0 1 2 3 4 5ASAP 0 0 1 0 2 3ALAP 1 1 2 0 2 3

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Create an initial number of TPs: suppose 3Create an initial number of TPs: suppose 3

0 1

2

3

4

5

MAXCS

1

2

3

Area

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Map each node of the critical path on each temporal partitionMap each node of the critical path on each temporal partition

0 1

2

3

4

5

MAXCS

2 cs

1

2

3

3

4

5

Area

1 cs

1 cs

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)

0 1

2

3

4

5

MAXCS

2 cs

1

2

3

3

4

5

Area

1 cs

1 cs

Algorithm Working Through an ExampleAlgorithm Working Through an Example

0

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

3

Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

2

Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

Try to map nodes in each temporal partition (3)Try to map nodes in each temporal partition (3)

0 1

2

3

4

5

2

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Relax: add 1 clock step to MAXCS Relax: add 1 clock step to MAXCS

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

3

Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)

2

Algorithm Working Through an ExampleAlgorithm Working Through an Example

10

2 cs

1

2

3

3

4

5

1 cs

1 cs

MAXCSArea

0 1

2

3

4

5

2

Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)

2

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Merge Operation (1) Merge Operation (1)

10

2 cs

1

2

3

3

4

5

2 cs

1 cs

MAXCSArea

0 1

2

3

4

5

2

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Merge Operation (1) Merge Operation (1)

10

1,2

3

3

4

5

MAXCSArea

2

0 1

2

3

4

54 cs

1 cs

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Merge Operation (2) Merge Operation (2)

10

1,2

3

3

4

5

1 cs

MAXCSArea

2

0 1

2

3

4

54 cs

Algorithm Working Through an ExampleAlgorithm Working Through an Example

Merge Operation (2) Merge Operation (2)

10

1,2,3

3

4

5

MAXCSArea

2

0 1

2

3

4

5

4 cs

Experimental ResultsExperimental Results

Near-optimal w/o sharing vs sharingNear-optimal w/o sharing vs sharing

0

2

4

6

8

10

12

14

16

18

#T

Ps

-30%

-20%

-10%

0%

10%

20%

30%

Pe

rf. Im

pro

v.

#p(SA) #p(Our*)#p(Our*) %(#cs-Our*)%(#cs-Our**)

EX1 SEHWA HAL EWF

Experimental ResultsExperimental Results

048

12

16202428

#TP

s

-16%-10%-4%2%8%14%20%26%32%

Per

f. Im

prov

.

#p(SA) #p(Our*) #p(Our*)

%(#cs-Our*) %(#cs-Our**)

Near-optimal w/o sharing vs sharingNear-optimal w/o sharing vs sharing

FIR MAT4x4

72 37

Experimental ResultsExperimental Results

Performance vs No. of Temporal PartitionsPerformance vs No. of Temporal Partitions

Mult4x4, RMAX=10 (no sharing of adders)

05

1015202530

1 3 5 7 9 11 13 15 17 19 21 23 25Initial Number of TPs

Final

#TPs

646668

7072

Exec

. (#c

s)

TPsExec.

Experimental ResultsExperimental Results

Is the algorithm good for scheduling?Is the algorithm good for scheduling?

0

5

10

15

20

25

30

35

#cs

known scheduling results

Our

EWF SEHWA

Comparison to some optimum results

Related WorkRelated Work

List-Scheduling considering dynamic reconfiguration [Vasilko et al., FPL’96]

ASAP [GajjalaPurna et al., IEEE Trans. on Comp., 1999]

Minimize latency taking onto account communication costs [Cardoso et al. VLSI’99]: Enhanced Static-List Scheduling Iterative approach (Simulated Annealing)

ILP formulation [SPARCs, DATE’98; RAW’98]

Enhanced Force-Directed List Scheduling [Pandey et al., SPIE’99]

And others [see the Related Work section]

ConclusionsConclusions

Novel algorithm simultaneously doing temporal partitioning and sharing of functional units Low complexity Heuristic approach Based on gradually enlarging of time slots

Permits to exploit the duality between the number of temporal partitions and resource sharing

Close-to-optimum results with some examples

Results proved that the algorithm is not weak when performing scheduling

Future WorkFuture Work

Enhancements to the algorithm: consider functional units with pipelining consider pipelining between execution and

reconfiguration

Study the possibility to take into account communication and reconfiguration costs

Test results with a reconfigurable computing system (comercial board)

Contact AuthorContact Author

João M. P. Cardoso

jmpc@acm.org

http://w3.ualg.pt/~jmcardo

THANK YOU!

Recommended