Bitwidth-Aware Scheduling and Binding in High-Level Synthesis

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis

X. Cheng+, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu+, Z. Zhang

Computer Science Department, UCLA+Microprocessor Development and Research Center, PKU

Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of f

unctional units (FU) Minimum weighted-interval-graph coloring proble

m for register allocation and binding Experimental results Conclusion

Motivation High-level languages

Big gap between design productivity and complexity Alleviate the design complexity Need to produce high-quality products

Need to consider multi-bitwidth Recent research shows there are 40% redundant bits in programs of hig

h-level languages [Stephenson et al, SIGPLAN’00] Hardware resource cost will be reduced with consideration of multi-bitwi

dth Area is proportional to input bitwidth for adders and registers, and is proporti

onal to the square of input bitwidth for multipliers Wire-length is reduced accordingly

Conventional high-level synthesis only focuses on resources with uniform bitwidth

Motivational Example-Impact of Bitwidth

Adders

＋

＋ *

＋

*

＋

* *

18 5

26 18*6 16

24*16 16*4 32*16

* (3 clock cycles)

+ (1 clock cycle)

Execution time: 8 clock cycles

+

+

+

+

* *

* *

18

5

26

18*6 16*4

24*16 32*16

26 18

Multipliers 32x16 24x16

+

+

+

16+

*

*

*

*

Adders

+

+

+

+

*

**

*

18

5

26

18*6

16*424*16

32*16

26 5

Multipliers 32x16 18x6

+ +

+

16

+

*

**

*

30% saving

31% saving

Related Works High-level synthesis with consideration of bitwidth

ILP formulation [Constantinides et al, IEEE Electronics Letters’00] Heuristic solution [Kum et al ’01] [Constantinides et al, DATE’01] Split adders into 1-bit [Molina et al DAC’02] Partially guarded computation [Choi et al, ISLPED’00]

Limitation No consideration of interconnect delay in scheduling an

d binding Interconnect delays dominate the timing in DSM techInterconnect delays dominate the timing in DSM tech

No optimality evaluation of proposed solutions for register allocation and binding




Bitwidth-Aware Synthesis Flow

Multiple bitwidth scheduling and binding problem

Given: (1) A DFG annotated with bitwidths, (2) a time constraint, (3) placement information of functional units, and (4) a resource IP library, where each resource type has arbitrary bitwidth configurations, each of which is associated with an area cost.

Objective: Schedule and bind the DFG into the library with consideration of interconnect delay from placement and without violating the time constraint, such that the final area of the required resources is minimized.

3. Bitwidth-aware Synthesis

Scheduled and bound DFG with bitwidth

DFG with initial scheduling & binding

DFG with bit-width information

Functional description in C

Compilation & optimization

Bit-width analysis

Scheduling, binding, & placement

Datapath & FSM generation

RTL VHDL

Bitwidth-aware scheduling and FU allocation & binding

Bitwidth-aware

register allocation & binding

1. Machine-SUIF

2. MCAS

4. Back end implementation

RDR+MCAS

Global Interconnect

…

LCC

…

LCC

…

LCC

…

LCC

…

LCC

…

LCC

FSMFSM

FSMFSM

FSMFSM

K cycles

1 cycle

2 cycles

Register file

IslandIsland

LocalComputationalCluster (LCC)

….

Register File

Wi

H iFSM

ALUMUL

Cluster with area constraint

1 cycle

2 cycle

K cycle

MUX

One solution for multi-cycle on-chip communication Regular Distributed Register (RDR) micro-architecture [Cong et al, ISPD’03] [Cong et al, ICCAD’03]

The whole chip is divided into an array of islands Chose the island size such that local computation and communication in each island can be done in a single cycle

MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis with physical planning

Placement information of functional units




Scheduling and Binding

Lower bound estimation of FU bitwidth for a DFGPrior works focus on the number of FUs

Lower-bound-based simultaneous scheduling and binding Time constrainedConsider the interconnect delay obtained from

placement information given by MCAS

Lower Bound Estimation

Extend the interval-based technique of [Sharma et al, 93] to support multi-bitwidth FUs

Main idea Compute the minimum resource requirement R(p, q) for each tim

e interval [p,q][1,T] The maximum of R(p, q) over all intervals is the final bitwidth lowe

r bound

Example of Lower-bound EstimationThe minimum bitwidth requirement for mu

ltipliers in interval [4, 7]

Theorem: For any feasible scheduling, the minimum overlap between operation o and interval [p,q] is:O(o, p, q) = min{ | Lifetime_ASAP[p, q] |, | Lifetime_ALAP [p, q] | }

The operation bitwidths that must be executed in [4,7] is {18, 24, 24, 32, 16}

The minimum bitwidth requirement for multipliers in [4,7] will be R (4, 7)={32, 16}

The minimum overlap between the multiplications, a, b, c and d, and interval [4,7]

O(a18*6, 4, 7) = 1 O(b24*16, 4, 7) = 2

+ ++

+

a

*d

*

b

*

c

*

18 5

26

18*6

16*4

24*16

32*16

16 step1

step2

step3

step4

step5

step6

step7

step8

+ +

+

+

a

*

d

*b

*c

*

18 5

26

18*6

16*4

24*16

32*16

16

ASAP ALAP

O(c32*16, 4, 7) = 1 O(d16*4, 4, 7) = 1

Sorted: {32, 24, 24, 18, 16}

a

* a

*

c

*

c

*

d

*

d

*

b

* b

*

Area Cost

Weighted-area lower bound of an unscheduled DFG is defined as

mulmuladd BBBA *

area for adders area for multipliers

a ratio weight of multiplier area over adder area

For a partially scheduled DFG, scheduling status S records the control steps for scheduled operations and feasible control steps for un-scheduled operations

A is calculated the same way, denoted as A(S)

Scheduling and Binding Algorithm-1

Goal: Minimize the area cost of required FUs Consider interconnect delay

Basic idea In each step, schedule an operation at a control step such that the result

ed weighted-area lower bound A(S) is kept as small as possible

16

32

16

16

32 A(16,1) = 48

add-32: feasible control step [2,3]

A(16,2) = 48

A(32,2) = 64

A(32,3) = 48

step1

step2

step3

add-16: feasible control step [1,2]

16 add-32: feasible control step [2,3]A(32,2) = 64

A(32,3) = 4832

How to choose an operation and one of its feasible control step

Scheduling and Binding Algorithm-2

Simultaneous scheduling and binding with consideration of interconnect delay

After operation o and c is chosen, FU binding is performed to decide whether o can be scheduled at step c finally There is an available FU usable by o at step c Data dependence between o and its scheduled and bound

predecessors and successors is maintained

16step1

step2

step3

*+

MUL

ADD

1 clock cycle

island

island+




Register Allocation and Binding Problem formulation

Given: A scheduled DFG annotated with bitwidth Objective: Perform register allocation and binding to

minimize the total bitwidth of registers

Register allocation Decide the minimum required registers

Register binding Explicitly map variables to register instances

Preliminaries

Scheduled DFG Life times of variables

24

18

16

5

Lifetime of a variable s(o): the control step where variable o is produced e(o): the last control step where variable o is consumed

Weighted interval graph

5

24 16

16*4 24*16

18*6

18

+

+

+

+

*

*

*

*

26

5 16

32*16

18

A proper coloring of G corresponds to a register allocation and binding scheme

Weight of a coloring scheme The weight of color c

W(c) = max{w(v) | v is colored with c } The weight of the coloring schem

e P is defined as W(G, P) = W(c).

24+16+18 = 58

5 18

24

16

Coloring Problem Weighted-interval-graph coloring problem

Given: A weighted interval graph G(V, E) Objective: Find a coloring scheme P of G, such that the weight

of the coloring scheme P, W(G, P), is minimized

Uniform weights Be solved in polynomial time (Left-edge)

Various weights The complexity remains unknown We propose a lower-bound estimation and an efficient algorithm

Lower-Bound Estimation

|C24| 1

24

18 |C18| 1

16|C16| 2

5

|C5| 3

Bitwidth lower bound 24*1+16*1+5*1=45

Scheduled DFG Life times of variables

16*4 24*16

18*6

18

+

+

+

+

*

*

*

*

26

5 16

32*16

24

18

16

5

5 18

24

16

Coloring Algorithm

16*4 24*16

18*6

18

+

+

+

+

*

*

*

*

26

5 16

32*16

24

18

16

5

Weight of coloring 24*1+16*1+5*1=45

16

5

Scheduled and bound DFG Life times of variables

5 18

24

16

24

18

16

5

24

18 5

Outline

Motivation Bitwidth-aware synthesis flow

Scheduling and binding to minimize total bits of functional units (FU)

Minimum weighted-interval-graph coloring problem for register allocation and binding

Experimental results Conclusion

Experimental Results-Weighted Interval-Graph Coloring

Designs Lower Bound Left-Edge+PostProcess [Kum et al ’01] Weighted IGC

aircraft 1270 1402 1335 1270

chem 896 962 929 897

dir 474 487 505 474

honda 312 328 368 313

lee 216 216 232 216

mcm 689 721 691 689

pr 270 297 298 270

u5ml 1717 1892 1778 1717

wang 269 293 302 269

Ave gap - +6.6% +7.5% +0.05%

Experimental Results-Three Synthesis Flows

Flow1 (MCAS) MCAS generates the scheduling and binding results

and placement information. All operations and variables have uniform bitwidth (32-bits).

Flow2 (MCAS+MB-PP) Perform a bitwidth post-processing after Flow1 is don

e, which is to set the bitwidth of a FU as the maximum bitwidth of all operations executed on it, and set the bitwidth of a register as the maximum bitwidth of all variables stored in it.

Flow3 (MCAS-MB) After MCAS generates the scheduling and binding re

sults and placement, the lower-bound-based scheduling & binding and the bitwidth-aware register allocation and binding are performed.

Share the same backend to generate datapath and controllers

Altera’s Quartus II version 2.2 0 is used to synthesize the resulting RTL VHDL onto the FPGA device StratixTM EP1S80F1508C6

Flow2

Flow1

Functional description in C

Compilation & optimization

Bit-width analysis


Datapath & FSM generation

RTL VHDL

Bitwidth-aware scheduling and FU allocation & binding

Bitwidth-aware

register allocation & binding

Bit-width analysis



Bitwidth Postprocess

Flow3

Experimental Results-Comparison of the Three Synthesis Flows

Design Node#MCAS MCAS+MB-PP MCAS-MB

LE WL(k) LE WL(k) LE WL(k)

aircraft 422 - - 10559 267 6860 181

chem. 342 8339 247 7101 191 4814 136

dir 127 2810 91 2075 48 1135 27

honda 107 2433 77 1774 38 1124 24

lee 49 1033 54 722 35 614 25

mcm 94 2562 105 2411 83 2392 75

pr 42 1194 63 1030 45 967 38

u5ml 565 14447 396 12774 318 7143 166

wang 48 1275 73 1078 36 1050 38

Ave Red. - 1 1 -18.1% -34.5% -36.3% -51.5%

• LE: Area results for datapath and control logic in terms of logic element• WL: Wire-length

Conclusions

We presented a complete bitwidth-aware high-level synthesis flow based on MCAS synthesis system

Experimental results Our bitwidth-aware synthesis flow achieves si

gnificant reduction for area and wire-length

Reference J. Choi, J. Jeon and K. Choi, “Power Minimization of Functional Units by Partially Guarded Computation,” Pro

c. of ISLPED, 2000 J. Cong, Y. Fan, X. Yang, and Z. Zhang, “Architecture and Synthesis for Multi-Cycle Communication,” Proc. O

f International Symposium on Physical Design, 2003. J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, "Architecture and Synthesis for On-Chip Multicycle Commun

ication," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2004 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal Datapath Allocation for Multiple-Wordlength Syst

ems,” IEEE Electronics Letters, 2000 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristic Datapath Allocation for Multiple Wordlength Sy

stems,” Proc. of Design, Automation and Test in Europe (DATE), 2001 K. Kum and W. Sung, “Combined Word-Length Optimization and High-Level Synthesis of Digital Signal Proce

ssing Systems,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, 2001 M. C. Molina, J. M. Mendias, and R. Hermida, “High-Level Synthesis of Multiple-Precision Circuits Independen

t of Data-Objects Length,” Proc. of the 39th Design Automation Conference, 2002 A. Sharma and R. Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applic

ations,” IEEE Trans. on VLSI Systems, 1993 M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” Pro

c. of the ACM SIGPLAN'2000 Conference on Programming Language Design and Implementation, 2000

Documents

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis